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We consider the problem of speaker diarization, the problem of segment- 
ing an audio recording of a meeting into temporal segments corresponding to 
individual speakers. The problem is rendered particularly difficult by the fact 
that we are not allowed to assume knowledge of the number of people partic- 
ipating in the meeting. To address this problem, we take a Bayesian nonpara- 
metric approach to speaker diarization that builds on the hierarchical Dirichlet 
process hidden Markov model (HDP-HMM) of Teh et al. (2006). Although 
the basic HDP-HMM tends to over-segment the audio data — creating redun- 
dant states and rapidly switching among them — we describe an augmented 
HDP-HMM that provides effective control over the switching rate. We also 
show that this augmentation makes it possible to treat emission distributions 
nonparametrically. To scale the resulting architecture to realistic diarization 
problems, we develop a sampling algorithm that employs a truncated approx- 
imation of the Dirichlet process to jointly resample the full state sequence, 
greatly improving mixing rates. Working with a benchmark NIST data set, 
we show that our Bayesian nonparametric architecture yields state-of-the-art 
speaker diarization results. 



1. Introduction. A recurring problem in many areas of information technol- 
ogy is that of segmenting a waveform into a set of time intervals that have a useful 
interpretation in some underlying domain. In this article we focus on a particu- 
lar instance of this problem, namely the problem of speaker diarization. In speaker 
diarization, an audio recording is made of a meeting involving multiple human par- 
ticipants and the problem is to segment the recording into time intervals associated 
with individual speakers (Wooters and Huijbregts, 2007). This segmentation is to 
be carried out without a priori knowledge of the number of speakers involved in 
the meeting; moreover, we do not assume that we have a priori knowledge of the 
speech patterns of particular individuals. 

Our approach to the speaker diarization problem is built on the framework of 
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hidden Markov models (HMMs), which have been a major success story not only in 
speech technology but also in many other fields involving complex sequential data, 
including genomics, structural biology, machine translation, cryptanalysis and fi- 
nance. An alternative to HMMs in the speaker diarization setting would be to treat 
the problem as a changepoint detection problem, but a key aspect of speaker di- 
arization is that speech data from a single individual generally recurs in multiple 
disjoint intervals. This suggests a Markovian framework in which the model tran- 
sitions among states that are associated with the different speakers. 

An apparent disadvantage of the HMM framework, however, is that classical 
treatments of the HMM generally require the number of states to be fixed a priori. 
While standard parametric model selection methods can be adapted to the HMM, 
there is little understanding of the strengths and weaknesses of such methods in 
this setting, and practical applications of HMMs generally fix the number of states 
using ad hoc approaches. It is not clear how to adapt HMMs to the diarization 
problem where the number of speakers is unknown. 

Building on work of Beal et al. (2002), Teh et al. (2006) presented a Bayesian 
nonparametric version of the HMM in which a stochastic process — the hierarchi- 
cal Dirichlet process (HDP) — defines a prior distribution on transition matrices 
over countably infinite state spaces. The resulting HDP-HMM is amenable to full 
Bayesian posterior inference over the number of states in the model. Moreover, this 
posterior distribution can be integrated over when making predictions, effectively 
averaging over models of varying complexity. The HDP-HMM has shown promise 
in a variety of applied problems, including visual scene recognition (Kivinen et al., 
2007), music synthesis (Hoffman et al., 2008), and the modeling of genetic recom- 
bination (Xing and Sohn, 2007) and gene expression (Beal and Krishnamurthy, 2006). 

While the HDP-HMM seems like a natural fit to the speaker diarization problem 
given its structural flexibility, as we show in Section 8, the HDP-HMM does not 
yield state-of-the-art performance in the speaker diarization setting. The problem 
is that the HDP-HMM inadequately models the temporal persistence of states. This 
problem arises in classical finite HMMs as well, where semi-Markovian models are 
often proposed as solutions. However, the problem is exacerbated in the nonpara- 
metric setting, in which the Bayesian bias towards simpler models is insufficient 
to prevent the HDP-HMM from giving high posterior probability to models with 
unrealistically rapid switching. This is demonstrated in Fig. 1, where we see that 
the HDP-HMM sampling algorithm creates redundant states and rapidly switches 
among them. (The figure also displays results from the augmented HDP-HMM — 
the "sticky HDP-HMM" that we describe in this paper.) The tendency to create 
redundant states is not necessarily a problem in settings in which model averaging 
is the goal. For speaker diarization, however, it is critical to infer the number of 
speakers as well as the transitions among speakers. 
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FIG 1. (a) Multinomial obser\>ation sequence; (b) true state sequence; (c)-(d) estimated state se- 
quence after 30,000 Gibbs iterations for the original and sticky HDP-HMM, respectively, with er- 
rors indicated in red. Without an extra self-transition bias, the HDP-HMM rapidly transitions among 
redundant states. 



Thus, one of our major goals in this paper is to provide a general solution to the 
problem of state persistence in HDP-HMMs. Our approach is easily stated — we 
simply augment the HDP-HMM to include a parameter for self-transition bias, and 
place a separate prior on this parameter. The challenge is to execute this idea coher- 
ently in a Bayesian nonparametric framework. Earlier papers have also proposed 
self-transition parameters for HMMs with infinite state spaces (Beal et al., 2002; 
Xing and Sohn, 2007), but did not formulate general solutions that integrate fully 
with Bayesian nonparametric inference. 

Another goal of the current paper is to develop a more fully nonparametric 
version of the HDP-HMM in which not only the transition distribution but also 
the emission distribution (the conditional distribution of observations given states) 
is treated nonparametrically. This is again motivated by the speaker diarization 
problem — in classical applications of HMMs to speech recognition problems it is 
often the case that emission distributions are found to be multimodal, and high- 
performance HMMs generally use finite Gaussian mixtures as emission distribu- 
tions (Gales and Young, 2008). In the nonparametric setting it is natural to replace 
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these finite mixtures with Dirichlet process mixtures. Unfortunately, this idea is 
not viable in practice, because of the tendency of the HDP-HMM to rapidly switch 
between redundant states. As we show, however, by incorporating an additional 
self-transition bias it is possible to make use of Dirichlet process mixtures for the 
emission distributions. 

An important reason for the popularity of the classical HMM is its computa- 
tional tractability. In particular, marginal probabilities and samples can be obtained 
from the HMM via an efficient dynamic programming algorithm known as the 
forward-backward algorithm (Rabiner, 1989). We show that this algorithm also 
plays an important role in computationally efficient inference for our generalized 
HDP-HMM. Using a truncated approximation to the full Bayesian nonparametric 
model, we develop a blocked Gibbs sampler which leverages forward-backward 
recursions to jointly resample the state and emission assignments for all observa- 
tions. 

The paper is organized as follows. In Section 2, we begin by summarizing re- 
lated prior work on the speaker diarization task and analyzing the key characteris- 
tics of the dataset we examine in Section 8. In Section 3, we provide some basic 
background on Dirichlet processes. Then, in Section 4, we overview the hierarchi- 
cal Dirichlet process and, in Section 5, discuss how it applies to HMMs and can 
be extended to account for state persistence. An efficient Gibbs sampler is also de- 
scribed in this section. In Section 7, we treat the case of nonparametric emission 
distributions. We discuss our application to speaker diarization in Section 8. A list 
of notational conventions can be found in the Supplementary Material (Fox et al., 
2010). 

2. The Speaker Diarization Task. There is a vast literature on the speaker 
diarization task, and in this section we simply aim to provide an overview of the 
most common techniques. We refer the interested reader to Tranter and Reynolds 
(2006) for a more thorough exposition on the subject. 

Classical speaker diarization techniques typically employ a two-stage procedure 
that first segments the audio (or features thereof) using one of a variety of change- 
point algorithms. The inferred segments are then regrouped into a set of speaker 
labels via a clustering algorithm. For example, Reynolds and Torres-Carrasquillo 
(2004) propose a changepoint detection method based on the Bayesian Information 
Criterion (BIC). Specifically, a penalized likelihood ratio test is used to compare 
whether the data within a fixed window are better modeled via a single Gaussian 
or two Gaussians. The window gradually grows at each test until a changepoint is 
inferred, at which point the window is reinitialized at the inferred changepoint. An 
alternative changepoint detection technique, first proposed in Siegler et al. (1997), 
uses fixed length windows and computes the symmetric Kullback-Leibler (KL) 
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divergence between a pair of Gaussians each fit by the data in their respective 
windows. A post-processing step then sets the changepoints equal to the peaks 
of the computed KL that exceed a predetermined threshold. In order to group 
the inferred segments into a set of speaker labels, a common approach is to use 
hierarchical agglomerative clustering with a BIC stopping criterion, as proposed 
in Chen and Gopalakrishnam (1998). 

The simple two-stage approach outlined above suffers from the fact that errors 
made in the segmentation stage can degrade the performance of the subsequent 
clustering stage. A number of algorithms instead iterate between multiple stages 
of resegmentation (typically via Viterbi decoding) and clustering; for example, see 
Barras et al. (2004); Wooters et al. (2004). Iterative segmentation and clustering al- 
gorithms employing a Gaussian mixture model for each cluster (i.e., speaker), such 
as those proposed by Gauvain et al. (1998); Barras et al. (2004), have been shown 
to improve diarization performance. Overall, however, agglomerative clustering is 
extremely sensitive to the specified threshold for cluster merging, with different 
settings leading to either over- or under-clustering of the segments into speakers. 
The thresholds are typically set based on testing on an extensive training database. 

A number of more recent approaches have considered the problem of joint seg- 
mentation and clustering by employing HMMs to capture the repeated returns of 
speakers. To handle the fact that the state space is unknown, Meignier et al. (2000) 
introduces the use of an evolutive-HMM which is further developed in Meignier et al. 
(2001). The HMM is initialized to have one state and at each iteration a segment 
of speech is assumed to arise from an undetected speaker who is added to the 
model. The revised HMM is then used to resegment the audio, and this iterative 
procedure continues until the speaker labels have converged. An alternative HMM 
formulation is presented in Wooters and Huijbregts (2007). The data are initially 
split into K states, with K assumed to be larger than the number of true speakers, 
and the HMM states are iteratively merged according to a metric based on changes 
in BIC. At each iteration, Viterbi decoding is performed to resegment the features 
of the audio, and the inferred segments are used to fit a new HMM via expectation 
maximization (EM). Then, the BIC criterion is applied to decide whether to merge 
HMM states. The algorithm also includes HMM substates to impose minimum 
speaker durations. 

Our approach also seeks to jointly segment and cluster the audio into speaker- 
homogenous regions, as targeted by the HMM approaches of Meignier et al. (2001); 
Wooters and Huijbregts (2007), but within a Bayesian nonparametric framework 
that avoids relying on the heuristics employed by these previously proposed algo- 
rithms and allows for coherent Bayesian inference. 

The data set we consider in the experiments of Section 8 is a standard bench- 
mark data set distributed by NIST as part of the Rich Transcription 2004-2007 
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meeting recognition evaluations (NIST, 2007). The data set consists of 21 recorded 
meetings, each of which may have different sets of speakers both in number and 
identity. We use the first 19 Mel Frequency Cepstral Coefficients (MFCCs) 1 , com- 
puted over a 30ms window every 10ms, as a feature vector. After these features are 
computed, a speech/non-speech detector is run to identify and remove observations 
corresponding to non-speech. (Non-speech refers to time intervals in which nobody 
is speaking.) The pre-processing step of removing non-speech observations is im- 
portant in ensuring that the fitted acoustic models are not corrupted by non-speech 
information. 

When working with this dataset, we discovered that the high frequency con- 
tent of these features contained little discriminative information. Since minimum 
speaker durations are rarely less than 500ms, we chose to define the observations 
as averages over 250ms, non-overlapping blocks. This preprocessing stage also 
aids in achieving speaker dynamics at the correct granularity (as opposed to finer 
temporal scale features leading to inferring within-speaker dynamics in addition to 
global speaker changes.) In Fig. 2, we plot a histogram the speaker durations of 
our preprocessed features based on the ground truth labels provided for each of the 
21 meetings. From this plot, we see that a geometric duration distribution fits this 
data reasonably well. This motivates our approach of simply increasing the prior 
probability of self-transitions within a Markov framework rather than moving to 
the more complicated semi-Markov formulation of speaker transitions. 

Another key feature of the speaker diarization data is the fact that the speaker 
specific emissions are not well approximated by a single Gaussian; see Fig. 3. This 
observation has led many researchers to consider a mixture-of-Gaussians speaker 
model, as previously described. As demonstrated in Section 8, we show that achiev- 
ing state-of-the-art performance within our framework also relies on allowing for 
non-Gaussian emissions. 

3. Dirichlet Processes. A Dirichlet process (DP) is a distribution on proba- 
bility measures on a measurable space 0. This stochastic process is uniquely de- 
fined by a base measure H on © and a concentration parameter 7; we denote it by 
DP(7, H). Consider a random probability measure Go ~ DP(j,H). The DP is 
formally defined by the property that for any finite partition! -^1 > • • • 1 Ar:} of 0, 

(3.1) (G0OA1), . . • , Go(A K )) I 7, H ~ Dir( 7 ir(Ai), . . . , jH(A K )). 

'Mel-frequency cepstral coefficients (MFCCs) comprise a representation of the short-term power 
spectrum of a sound on the mel scale (a nonlinear scale of frequency based on the human auditory 
system response). Specifically, the computation of an MFCC typically involves (i) taking the Fourier 
transform of a windowed excerpt of a signal, (ii) mapping the log powers of the obtained spectrum 
onto the mel scale, and (iii) performing a discrete cosine transform of the mel log powers. The 
MFCCs are the amplitudes of the resulting spectrum. 
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FIG 2. Normalized histogram of 
speaker durations of the pre- 
processed audio features from 
the 21 meetings in the NIST 
database. A Geom(0.1) density 
is also shown for comparison. 



FIG 3. Contour plots of the best fit Gaussian ( top) and kernel 
density estimate ( bottom ) for the top two principal compo- 
nents of the audio features associated with each of the four 
speakers present in the AM1J20041210-1052 meeting. With- 
out capturing the non-Gaussianity of the speaker-specific 
emissions, the speakers are challenging to identify. 



That is, the measure of a random probability distribution Go ~ DP (7, H) on every 
finite partition of follows a finite-dimensional Dirichlet distribution (Ferguson, 
1973). A more constructive definition of the DP was given by Sethuraman (1994). 
Consider a probability mass function (pmf) {(3k}kLi o n a countably infinite set, 
where the discrete probabilities are defined as follows: 

v k I 7 ~ Beta(l,7) k = 1,2, ... 
fc-i 

(3.2) h = v k J[{l-v i ) fc = 1,2, .... 

e=i 

In effect, we have divided a unit-length stick into lengths given by the weights /3 k '- 
the k th weight is a random proportion v k of the remaining stick after the previous 
(k — 1) weights have been defined. This stick-breaking construction is generally 
denoted by /3 ~ GEM (7). With probability one, a random draw Go ~ DP (7, H) 
can be expressed as 

00 

(3.3) G = J2^Se k 6 k \H~H, A; = 1,2,..., 

k=l 

where 5$ denotes a unit-mass measure concentrated at 9 and where are drawn 
independently from H. From this definition, we see that the DP actually defines 
a distribution over discrete probability measures. The stick-breaking construction 
also gives us insight into how the concentration parameter 7 controls the relative 
magnitude of the mixture weights /3k, and thus determines the model complexity 
in terms of the expected number of components with significant probability mass. 

The DP has a number of properties which make inference based on this nonpara- 
metric prior computationally tractable. Consider a set of observations {6>-} with 



8 



E. FOX ET AL. 



Q'l ~ Go- Because probability measures drawn from a DP are discrete, there is 
a strictly positive probability of multiple observations 6^ taking identical values 
within the set {6k}, with df. denned as in Equation (3.3). For each value let Z{ 
be an indicator random variable that picks out the unique value k such that Q\ = 6 Zi . 
Blackwell and MacQueen (1973) introduced a Polya urn representation of the 6'f 



7 



i-i 

1 ^ 7 + i - 1 i 



7 + i — 1 r— f 7 + 

K 

7 tt , sr^ n k 



(3.4) = - H + V 

V 7 + i-l ^7 + *-l 

implying the following predictive distribution for the indicator random variables: 

(3.5) 

1 K 

p(ztv+i = z | 21,...,^, 7) = tt- — 6(z,K + 1) + — — Vn fc 5(z,A;). 

jV + 7 N + 7 

Here, = X^^i ^) i s tne number of indicator random variables taking the 
value k, and K + 1 is a previously unseen value. We use the notation 5(z, k) to 
indicate the discrete Kronecker delta. This representation can be used to sample 
observations from a DP without explicitly constructing the countably infinite ran- 
dom probability measure Go ~ DP (7, H). 

The distribution on partitions induced by the sequence of conditional distribu- 
tions in Equation (3.5) is commonly referred to as the Chinese restaurant process. 
The analogy, which is useful in developing various generalizations of the Dirichlet 
process we consider in this paper, is as follows. Take i to be a customer entering a 
restaurant with infinitely many tables, each serving a unique dish Of.. Each arriving 
customer chooses a table, indicated by Z{, in proportion to how many customers are 
currently sitting at that table. With some positive probability proportional to 7, the 
customer starts a new, previously unoccupied table K + 1. The Chinese restaurant 
process captures the fact that the DP has a clustering property such that multiple 
draws from the random measure take the same value. 

The DP is commonly used as a prior on the parameters of a mixture model with a 
random number of components. Such a model is called a Dirichlet process mixture 
model and is depicted as a graphical model in Fig. 4(a)-(b). To generate observa- 
tions, we choose 9^ ~ Go and ~ f° r an indexed family of distributions 
F(-). This sampling process is also often described in terms of the indicator ran- 
dom variables Zi\ in particular, we have zi ~ /3 and y» ~ F(Q Zi ). The parameter 
with which an observation is associated implicitly partitions or clusters the data. 
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In addition, the Chinese restaurant process representation indicates that the DP 
provides a prior that makes it more likely to associate an observation with a param- 
eter to which other observations have already been associated. This reinforcement 
property is essential for inferring finite, compact mixture models. It can be shown 
under mild conditions that if the data were generated by a finite mixture, then the 
DP posterior is guaranteed to converge (in distribution) to that finite set of mixture 
parameters (Ishwaran and Zarepour, 2002a). 

4. Hierarchical Dirichlet Processes. In the following section we describe 
how ideas based on the Dirichlet process have been used to develop a Bayesian 
nonparametric approach to hidden Markov modeling in which the number of states 
is unknown a priori. To develop this nonparametric version of the HMM the Dirich- 
let process does not suffice; rather, it is necessary to develop a hierarchical Bayesian 
model involving a tied collection of Dirichlet processes. This has been done by 
Teh et al. (2006) whose hierarchical Dirichlet process (HDP) we describe in this 
section. The HDP is applicable to general problems involving related groups of 
data, each of which can be modeled using a DP, and we begin by describing the 
HDP at this level of generality, subsequently specializing to the HMM. 

To describe the HDP, suppose there are J groups of data and let {yji , . . . , VjN-} 
denote the set of observations in group j. Assume that there are a collection of DP 
mixture models underlying the observations in these groups: 

G j = XXi KjtSe* t I a ~ GEM(a) j = 1, . . . , J 

9* t | Go ~G t = 1,2, ... 

(4.1) 

pjiiGj-Gj //;. J 1 / 

i = l,...,N r 

We wish to tie the DP mixtures across the different groups such that atoms that 
underly the data in group j can be used in group j'. The problem is that if Go 
is absolutely continuous with respect to Lebesgue measure (as it generally is for 
continuous parameters) then the atoms in Gj will be distinct from those in Gj> with 
probability one. The solution to this problem is to let Go itself be a draw from a 
DP: 

(A7 , G = Z?=iPk5e k /3| 7 ~GEM( 7 ) 

^ O k \H,\~H(\) fc = l,2,.... 

In this hierarchical model, Go is atomic and random. Letting Go be a base measure 
for the draw Gj ~ DP(a,Go) implies that only these atoms can appear in Gj. 
Thus atoms can be shared among the collection of random measures {Gj}. The 
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FIG 4. Dirichlet process (left) and hierarchical Dirichlet process (right) mixture models repre- 
sented in two different ways as graphical models, (a) Indicator variable representation in which 
/3|7 ~ GEM(i), 9 k \H,X ~ H(X), Zi\/3 ~ /3, and yi\{0 h }kL u X ~ F(0 Xi ). (b) Alternative rep- 
resentation with Go\ce, H ~ DP(a,H), 6>£|Go ~ Go, arcdyi|#i ~ F(9' i ). (c) Indicator variable 
representation in which /3|7 ~ GEM(i), irk\a,f3 ~ DP(q,/3), #fc|i?, A ~ H(X), Zji\tTj ~ 7Tj, 
cmii 26t|{^fc}fcLi, «ji ~ F(6 Zji ). (d) Alternative representation with Go\y,H ~ DP(7,_ff), 



GjIGo ~ DP(a,G ) 



Gj, and yji | 



F(9ji). The "plate" notation is used to com- 



pactly represent replication (Teh etai, 2006). 



HDP model is depicted graphically in two different ways in Fig. 4(c) and Fig. 4(d). 

Teh et al. (2006) have also described the marginal probabilities obtained from 
integrating over the random measures Go and {Gj}. They show that these marginals 
can be described in terms of a Chinese restaurant franchise (CRF) that is an ana- 
log of the Chinese restaurant process. The CRF is comprised of J restaurants, each 
corresponding to an HDP group, and an infinite buffet line of dishes common to 
all restaurants. The process of seating customers at tables, however, is restaurant 
specific. Each customer is pre-assigned to a given restaurant determined by that 
customer's group j. Upon entering the j th restaurant in the CRF, customer sits 
at currently occupied tables tji with probability proportional to the number of cur- 
rently seated customers, or starts a new table Tj + 1 with probability proportional to 
a. The first customer to sit at a table goes to the buffet line and picks a dish kjt for 
their table, choosing the dish with probability proportional to the number of times 
that dish has been picked previously, or ordering a new dish 9k+i with probability 
proportional to 7. The intuition behind this predictive distribution is that integrating 
over the global dish probabilities /3 results in customers making decisions based on 
the observed popularity of the dishes throughout the entire franchise. See the Sup- 
plementary Material for further details (Fox et al., 2010). 

Recalling Equation (4.1)-(4.2), since each distribution Gj is drawn from a DP 
with a discrete base measure Go, multiple 9* t may take an identical value 9^ for 
multiple unique values of t. As we see in the Supplemental Material (Fox et al., 
2010), this corresponds to multiple tables in the same restaurant being served the 
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same dish. We can write Gj as a function of the unique dishes: 



oo 



(4.3) Gj = J2^kS ei 



TTj | a,/9~ DP(a, /3) , 6 k \H ~ H, 



k=l 



where Ttj now defines a restaurant-specific distribution over dishes served rather 
than over tables, with 



Let Zji be the indicator random variable for the unique dish selected by obser- 
vation yji. A third equivalent representation for the generative model is in terms of 
these indicator random variables: 

(4.5) TTj I a,P ~ DP(q, (3) \ ttj ~ yr,- | {<9 fc }, zyi ~ F{0 Zji ), 

and is shown in Fig. 4(c). 

5. The Sticky HDP-HMM. Recall that the hidden Markov model, or HMM, 
is a class of doubly stochastic processes based on an underlying, discrete-valued 
state sequence, which is modeled as Markovian (Rabiner, 1989). Let Zt denote the 
state of the Markov chain at time t and ttj the state-specific transition distribu- 
tion for state j. Then, the Markovian structure on the state sequence dictates that 
zt ~ Kzt-i- The observations, yt, are conditionally independent given this state 
sequence, with y t ~ F(6 Zt ) for some fixed distribution F(-). 

The HDP can be used to develop an HMM with an infinite state space — the 
HDP-HMM (Teh et al., 2006). In the speaker diarization task, each state constitutes 
a different speaker and our goal in moving to an infinite state space is to remove 
upper bounds on the total number of speakers present. Conceptually, we envision a 
doubly-infinite transition matrix, with each row corresponding to a Chinese restau- 
rant. That is, the groups in the HDP formalism here correspond to states, and each 
Chinese restaurant defines a distribution on next states. The CRF links these next- 
state distributions. Thus, in this application of the HDP, the group-specific distribu- 
tion, TTj, is a state-specific transition distribution and, due to the infinite state space, 
there are infinitely many such groups. Since Zt ~ 7r« t _ i; we see that zt-i indexes 
the group to which y t is assigned (i.e., all observations with z t -i = j are assigned 
to group j). Just as with the HMM, the current state zt then indexes the parameter 
6 Zt used to generate observation y t (see Fig. 5(a)). 

By defining ttj ~ DP(a, /?), the HDP prior encourages states to have similar 
transition distributions (E[itjk \ P] = /3k)- However, it does not differentiate self- 
transitions from moves between different states. When modeling data with state 



(4.4) 




12 



E. FOX ET AL. 




FIG 5. (a) Graphical representation of the sticky HDP-HMM. The state evolves as 
Zt+i\{iTk}kLi,Zt ~ 7Tz t . where 7Vk\a, k, /3 ~ DP(a + K, (a/3 + n5k)/(a + n)) and /3\j ~ 
GEM(j), and observations are generated as yt\{9k}hLu z t ~ ^X^t)- ^ e original HDP-HMM 
has k — 0. (b) Sticky HDP-HMM with DP emissions, where St indexes the state-specific mix- 
ture component generating observation yt- The DP prior dictates that St |{'0fc}fe^=i j z t ~ V'zt / c "' 

*0yt|{0fcj}£j=i,zt,st ~ F{0 zust ) 



GEM(a). The j th Gaussian component of the k th mixture density is parameterized by 8k,j 



persistence, the flexible nature of the HDP-HMM prior allows for state sequences 
with unrealistically fast dynamics to have large posterior probability. For example, 
with multinomial emissions, a good explanation of the data is to divide different 
observation values into unique states and then rapidly switch between them (see 
Fig. 1). In such cases, many models with redundant states may have large poste- 
rior probability, thus impeding our ability to identify a compact dynamical model 
which best explains the observations. The problem is compounded by the fact that 
once this alternating pattern has been instantiated by the sampler, its persistence 
is then reinforced by the properties of the Chinese restaurant franchise, thus slow- 
ing mixing rates. Furthermore, this fragmentation of data into redundant states can 
reduce predictive performance, as is discussed in Section 6. In many applications, 
one would like to be able to incorporate prior knowledge that slow, smoothly vary- 
ing dynamics are more likely. 

To address these issues, we propose to instead model the transition distributions 
TTj as follows: 



P | 7 ~ GEM (7) 
(5.1) Tij I a, k, (5 ~ DP I a + k, 



a/3 + n5j 
a + n 

Here, (a/3 + nSj) indicates that an amount k > is added to the j th component of 
a/3. Informally, what we are doing is increasing the expected probability of self- 
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transition by an amount proportional to k: 

(5.2) E[7r jk \P,K] = ■ . 

a + k 

More formally, over a finite partition (Z\, . . . , Zk) of the positive integers Z+, the 
prior on the measure 7T, adds an amount k only to the arbitrarily small partition 
containing j, corresponding to a self-transition. That is, 

(5.3) 

(ttjCZi), . . .,Trj(Z K )) | q,/3 ~ Dir(a/3(Zi) + k^(Zi), . . .,a(3(Z K ) + KSj{Z K )). 

When k = the original HDP-HMM of Teh et al. (2006) is recovered. Because 
positive k values increase the prior probability E[irjj | /3] of self-transitions, we re- 
fer to this extension as the sticky HDP-HMM. See Fig. 5(a). Note that this formula- 
tion assumes that the stickiness of each HMM state is the same a priori. The param- 
eter could be made state-dependent through a hierarchical model that ties together 
a collection of state-specific sticky parameters. However, such state-specific stick- 
iness is unnecessary for the speaker diarization task at hand since each speaker is 
assumed to have similar expected durations. Differences between speaker-specific 
transitions become more distinguished in the posterior. 

The k parameter is reminiscent of the self-transition bias parameter of the infi- 
nite HMM, an urn model for hidden Markov models on infinite state spaces that 
predated the HDP-HMM (Beal et al, 2002). The connection between the (sticky) 
HDP-HMM and the infinite HMM is analogous to that between the DP and the 
Polya urn; in both cases the latter is obtained by integrating out the random mea- 
sures in the former. In particular, the infinite HMM employs a two-level urn model 
in which the top-level urn places a probability on transitions to existing states in 
proportion to how many times these transitions have been seen, with an added 
bias for a self-transition even if this has not previously occurred. With some re- 
maining probability an oracle is called, representing the second-level urn. This 
oracle chooses an existing state in proportion to how many times the oracle pre- 
viously chose that state, regardless of the state transition involved, or chooses a 
previously unvisited state. The original HDP-HMM provides an interpretation of 
this urn model in terms of an underlying collection of linked random probability 
measures, however without the self-transition parameter. In addition to the concep- 
tual clarity provided by the random measure formalism, the HDP-HMM has the 
practical advantage that it makes it possible to use standard MCMC algorithms for 
posterior inference; working within the urn model formulation Beal et al. (2002) 
needed to resort to a heuristic approximation to a Gibbs sampler. The sticky HDP- 
HMM, an early version of which was presented in Fox et al. (2008), restores the 
self-transition parameter of the infinite HMM to this class of models, doing so in a 
way that integrates with a full Bayesian nonparametric specification. 
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As with the DP, this specification in terms of random measures yields various 
interesting characterizations of marginal probabilities. In particular, as described in 
the Supplemental Material (Fox et al., 2010), the partitioning structure induced by 
the sticky HDP-HMM has an interpretation as an extension of the Chinese restau- 
rant franchise (CRF) which we refer to as a CRF with loyal customers. Here, 
each restaurant in the franchise has a specialty dish with the same index as that 
of the restaurant. Although this dish is served elsewhere, it is more popular in the 
dish's namesake restaurant. Recall that while customers in the CRF of the HDP 
are pre-partitioned into restaurants based on the fixed group assignments, in the 
HDP-HMM the value of the state Zt determines the group assignment (and thus 
restaurant) of customer yt+i. The increased popularity of the house specialty dish 
(determined by the sticky parameter k) implies that children are more likely to eat 
in the same restaurant as their parent (z t = zt-i = j) and, in turn, more likely 
to eat the restaurant's specialty dish (zt+i = j). This develops family loyalty to a 
given restaurant in the franchise. However, if the parent chooses a dish other than 
the house specialty (z t = k, k ^ j), the child will then go to the restaurant where 
this dish is the specialty and will in turn be more likely to eat this dish, too. One 
might say that for the sticky HDP-HMM, children have similar tastebuds to their 
parents and will always go to the restaurant that prepares their parent's dish best. 
Often, this keeps many generations eating in the same restaurant. 

Throughout the remainder of the paper, we use the following notational conven- 
tions. Given a random sequence {x\, X2, • • • , xt}, we use the shorthand x\-t to de- 
note the sequence {xi,X2, ■ ■ ■ , xt}and x\ t to denote the set {x\, . . . ,xt-\,x t +i, . . . ,xt}. 
Also, for random variables with double subindices, such as x aia2 , we will use x 
to denote the entire set of such random variables, {x aia2 , Vai, Va2}, and the short- 
hand notation x ai . = J2 a2 x a ia2 , x. a2 = ]T ai x aia2 , and x.. = J2 ai Sa 2 x ^i^- 

5.1. Sampling via Direct Assignments. In this section we present an inference 
algorithm for the sticky HDP-HMM of Section 5 and Fig. 5(a) that is a modified 
version of the direct assignment Rao-Blackwellized Gibbs sampler of Teh et al. 
(2006). This sampler circumvents the complicated bookkeeping of the CRF by 
sampling indicator random variables directly. The resulting sticky HDP-HMM di- 
rect assignment Gibbs sampler is outlined in Algorithm 1 of the Supplementary 
Material (Fox et al., 2010), which also contains the full derivations of this sampler. 

The basic idea is that we marginalize over the infinite set of state-specific transi- 
tion distributions -k^ and parameters 9^, and sequentially sample the state zt given 
all other state assignments Z\+, the observations t/u, and the global transition dis- 
tribution /3. A variant of the Chinese restaurant process gives us the prior proba- 
bility of an assignment of z t to a value k based on how many times we have seen 
other transitions from the previous state value z t -\ to k and k to the next state value 
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Zt+i- As derived in the Supplementary Material (Fox et al., 2010), this conditional 
distribution is dependent upon whether either or both of the transitions zt-i to k 
and k to Zf+i correspond to a self-transition, most strongly when k > 0. The prior 
probability of an assignment of zt to state k is then weighted by the likelihood of 
the observation y t given all other observations assigned to state k. 

Given a sample of the state sequence z\-jt, we can represent the posterior distri- 
bution of the global transition distribution /3 via a set of auxiliary random variables 
ifijk, rrijk, and Wjt, which correspond to the j th restaurant-specific set of table 
counts associated with the CRF with loyal customers described in the Supplemen- 
tal Material (Fox et al., 2010). The Gibbs sampler iterates between sequential sam- 
pling of the state Zt for each individual value of t given (3 and z\ t ; sampling of the 
auxiliary variables fhjk, rrijk, and Wj t given z\-t and /3; and sampling of (3 given 
these auxiliary variables. 

The direct assignment sampler is initialized by sampling the hyperparameters 
and (3 from their respective priors and then sequentially sampling each z t as if the 
associated y t was the last observation. That is, we first sample z\ given y\, (3, and 
the hyperparameters. We then sample Z2 given z\, y± : 2, (3, and the hyperparam- 
eters, and so on. Based on the resulting sample of z\-t, we resample (3 and the 
hyperparameters. From then on, the sampler continues with the normal procedure 
of conditioning on z\ t when resampling z t . 

5.2. Blocked Sampling of State Sequences. The HDP-HMM sequential, direct 
assignment sampler of Section 5.1 can exhibit slow mixing rates since global state 
sequence changes are forced to occur coordinate by coordinate. This phenomenon 
is explored in Scott (2002) for the finite HMM. Although the sticky HDP-HMM 
reduces the posterior uncertainty caused by fast state-switching explanations of the 
data, the self-transition bias can cause two continuous and temporally separated 
sets of observations of a given state to be grouped into two states. See Fig. 6(b) for 
an example. If this occurs, the high probability of self-transition makes it challeng- 
ing for the sequential sampler to group those two examples into a single state. 

We thus propose using a variant of the HMM forward-backward procedure 
(Rabiner, 1989) to harness the Markovian structure and jointly sample the state 
sequence z\-t given the observations yi-r, transition probabilities tt^, and param- 
eters 9k- There are two main mechanisms for sampling in an uncollapsed HDP 
model (i.e., one that instantiates the parameters ir^ and 9k)'- one is to employ slice 
sampling while the other is to consider a truncated approximation to the HDP. 
For the HDP-HMM, a slice sampler, referred to as beam sampling, was recently 
developed (Van Gael et al., 2008). This sampler harnesses the efficiencies of the 
forward-backward algorithm without having to fix a truncation level for the HDP. 
However, as we elaborate upon in Section 6.1, this sampler suffers from slower 
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mixing rates than the block sampler we propose, which utilizes a fixed-order trun- 
cation of the HDP-HMM. Although a fixed truncation reduces our model to a 
parametric Bayesian HMM, the specific hierarchical prior induced by a trunca- 
tion of the fully nonparametric HDP significantly improves upon classical para- 
metric Bayesian HMMs. Specifically, a fixed degree L truncation encourages each 
transition distribution to be sparse over the set of L possible HMM states, and si- 
multaneously encourages transitions from different states to have similar sparsity 
structures. That is, the truncated HDP prior leads to a shared sparse subset of the 
L possible states. See Section 6.3 for a comparison with standard parametric mod- 
eling. 

There are multiple methods of approximating the countably infinite transition 
distributions via truncations. One approach is to terminate the stick-breaking con- 
struction after some portion of the stick has already been broken and assign the 
remaining weight to a single component. This approximation is referred to as the 
truncated Dirichlet process. Another method is to consider the degree L weak limit 
approximation to the DP (Ishwaran and Zarepour, 2002b), 

(5.4) GEM L (a) = Dir(a/L, . . . , a/L), 

where L is a number that exceeds the total number of expected HMM states. Both 
of these approximations, which are presented in Ishwaran and Zarepour (2002b, 
2000), encourage the learning of models with fewer than L components while al- 
lowing the generation of new components, upper bounded by L, as new data are 
observed. We choose to use the second approximation because of its simplicity 
and computational efficiency. The two choices of approximations are compared 
in Kurihara et al. (2007), and little to no practical differences are found. Using a 
weak limit approximation to the Dirichlet process prior on /3 (i.e., employing a 
finite Dirichlet prior) induces a finite Dirichlet prior on irf 

(5.5) /3| 7 ~Dir( 7 /L,..., 7 /L) 

(5.6) TTj I a, /3 ~ Dir(a$L, . . . , ol^£). 

As L — > oo, this model converges in distribution to the HDP mixture model 
(Teh et al., 2006). 

The Gibbs sampler using blocked resampling of z\-t is derived in the Supple- 
mentary Material (Fox et al., 2010); an outline of the resulting algorithm is also 
presented (see Algorithm 3). A similar sampler has been used for inference in HDP 
hidden Markov trees (Kivinen et al., 2007). However, this work did not consider the 
complications introduced by multimodal emissions, which we explore in Section 7. 

The blocked sampler is initialized by drawing L parameters 6^ from the base 
measure, /3 from its L-dimensional symmetric Dirichlet prior, and the L transi- 
tion distributions TTk from the induced L-dimensional Dirichlet prior specified in 
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Equation (5.6). The hyperparameters are also drawn from the prior. Based on the 
sampled parameters and transition distributions, one can block sample z\-t and 
proceed as in Algorithm 3. 

5.3. Hyperparameters. We treat the hyperparameters in the sticky HDP-HMM 
as unknown quantities and perform full Bayesian inference over these quantities. 
This emphasizes the role of the data in determining the number of occupied states 
and the degree of self-transition bias. Our derivation of sampling updates for the 
hyperparameters of the sticky HDP-HMM is presented in the Supplementary Mate- 
rial (Fox et al., 2010); it roughly follows that of the original HDP-HMM (Teh et al., 
2006). A key step which simplifies our inference procedure is to note that since we 
have the deterministic relationships 

a = (1 — p)(a + k) 
(5.7) k = p(a + k), 

we can treat p and a + k as our hyperparameters and sample these values instead 
of sampling a and k directly. 

6. Experiments with Synthetic Data. In this section, we explore the perfor- 
mance of the sticky HDP-HMM relative to the original model (i.e., the model with 
k = 0) in a series of experiments with synthetic data. We judge performance ac- 
cording to two metrics: our ability to accurately segment the data according to the 
underlying state sequence, and the predictive likelihood of held-out data under the 
inferred model. We additionally assess the improvements in mixing rate achieved 
by using the blocked sampler of Section 5.2. 

6. 1. Gaussian Emissions. We begin our analysis of the sticky HDP-HMM per- 
formance by examining a set of simulated data generated from an HMM with Gaus- 
sian emissions. The first dataset is generated from an HMM with a high probability 
of self-transition. Here, we aim to show that the original HDP-HMM inadequately 
captures state persistence. The second dataset is from an HMM with a high prob- 
ability of leaving the current state. In this scenario, our goal is to demonstrate that 
the sticky HDP-HMM is still able to capture rapid dynamics by inferring a small 
probability of self-transition. 

For all of the experiments with simulated data, we used weakly informative hy- 
perpriors. We placed a Gamma(l, 0.01) prior on the concentration parameters 7 
and (a + k). The self-transition proportion parameter p was given a Beta(10, 1) 
prior. The parameters of the base measure were set from the data, as will be de- 
scribed for each scenario. 
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State Persistence. The data for the high persistence case were generated from a 
three-state HMM with a 0.98 probability of self-transition and equal probability 
of transitions to the other two states. The observation and true state sequences for 
the state persistence scenario are shown in Fig. 6(a). We placed a normal inverse- 
Wishart prior on the space of mean and variance parameters and set the hyperpa- 
rameters as: 0.01 pseudocounts, mean equal to the empirical mean, three degrees 
of freedom, and scale matrix equal to 0.75 times the empirical variance. We used 
this conjugate base measure so that we may directly compare the performance of 
the blocked and direct assignment samplers. For the blocked sampler, we used a 
truncation level of L = 20. 

In Fig. 6(d)-(h), we plot the 10 th , 50 th , and 90 th quantiles of the Hamming 
distance between the true and estimated state sequences over the 1000 Gibbs iter- 
ations using the direct assignment and blocked samplers on the sticky and original 
HDP-HMM models. To calculate the Hamming distance, we used the Munkres al- 
gorithm (Munkres, 1957) to map the randomly chosen indices of the estimated state 
sequence to the set of indices that maximize the overlap with the true sequence. 

From these plots, we see that the burn-in rate of the blocked sampler using the 
sticky HDP-HMM is significantly faster than that of any other sampler-model com- 
bination. As expected, the sticky HDP-HMM with the sequential, direct assignment 
sampler gets stuck in state sequence assignments from which it is hard to move 
away, as conveyed by the flatness of the Hamming error versus iteration number 
plot in Fig. 6(g). For example, the estimated state sequence of Fig. 6(b) might have 
similar parameters associated with states 3, 7, 10 and 11 so that the likelihood is 
in essence the same as if these states were grouped, but this sequence has a large 
error in terms of Hamming distance and it would take many iterations to move 
away from this assignment. Incorporating the blocked sampler with the original 
HDP-HMM improves the Hamming distance performance relative to the sequen- 
tial, direct assignment sampler for both the original and sticky HDP-HMM; how- 
ever, the burn-in rate is still substantially slower than that of the blocked sampler 
on the sticky model. 

As discussed earlier, a beam sampling algorithm (Van Gael et al., 2008) has 
been proposed which adapts slice sampling methods (Robert, 2007) to the HDP- 
HMM. This approach uses a set of auxiliary slice variables, one for each observa- 
tion, to effectively truncate the number of state transitions that must be considered 
at every Gibbs sampling iteration. Dynamic programming methods can then be 
used to jointly resample state assignments. The beam sampler was inspired by a re- 
lated approach for DP mixture models (Walker, 2007), which is conceptually sim- 
ilar to retrospective sampling methods (Papaspiliopoulos and Roberts, 2008). In 
comparison to our fixed-order, weak-limit truncation of the HDP-HMM, the beam 
sampler provides an asymptotically exact algorithm. However, the beam sampler 
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FIG 6. (a) Observation sequence (blue) and true state sequence (red) for a three-state HMM with 
state persistence, (b) Example of the sticky HDP-HMM direct assignment Gibbs sampler splitting 
temporally separated examples of the same true state (red) into multiple estimated states (blue) at 
Gibbs iteration 1,000. (c) Histogram of the inferred self-transition proportion parameter, p, for the 
sticky HDP-HMM blocked sampler. For the original HDP-HMM, the median (solid blue) and 10 th 
and 90 th quantiles ( dashed red) of Hamming distance between the true and estimated state sequences 
over the first 1,000 Gibbs samples from 200 chains are shown for the (d) direct assignment sampler, 
and (e) blocked sampler, (f) Hamming distance over 30,000 Gibbs samples from three chains of the 
original HDP-HMM blocked sampler, (g)-(i) and (k)-(l) Analogous plots to (d)-(f) for the sticky HDP- 
HMM and a non-sticky HDP-HMM using beam sampling, respectively, (j) A histogram of the effective 
beam sampler truncation level, L e ff, over the 30,000 Gibbs iterations from the three chains (blue) 
compared to the fixed truncation level, L = 20, used in the truncated sticky HDP-HMM blocked 
sampler results ( red). 
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can be slow to mix relative to our blocked sampler on the fixed, truncated model 
(see Fig. 6 for an example comparison.) The issue is that in order to consider a 
transition which has low prior probability, one needs a correspondingly rare slice 
variable sample at that time. Thus, even if the likelihood cues are strong, to be 
able to consider state sequences with several low-prior-probability transitions, one 
needs to wait for several rare events to occur when drawing slice variables. By con- 
sidering the full, exponentially large set of paths in the truncated state space, we 
avoid this problem. Of course, the trade-off between the computational cost of the 
blocked sampler on the fixed, truncated model (0(TL 2 )) and the slower mixing 
rate of the beam sampler yields an application-dependent sampler choice. 

The Hamming distance plots of Fig. 6(k) and (1), when compared to those of 
Fig. 6(e) and (f), depict the substantially slower mixing rate of the beam sampler 
compared to the blocked sampler (both using a non-sticky HDP-HMM). However, 
the theoretical computational benefit of the beam sampler can be seen in Fig. 6(j). 
In this plot, we present a histogram of the effective truncation level, L e t t , used over 
the 30,000 Gibbs iterations on three chains. We computed this effective truncation 
level by summing over the number of state transitions considered during a full 
sweep of sampling z\-t and then dividing this number by the length of the dataset, 
T, and taking the square root. Finally, on a more technical note, our fixed, truncated 
model allows for more vectorization of the code than the beam sampler. Thus, in 
practice, the difference in computation time between the samplers is significantly 
less than the 0{L 2 / L 2 e ^) factor obtained by counting state transitions. 

From this point onwards, we present results only from blocked sampling since 
we have seen the clear advantages of this method over the sequential, direct assign- 
ment sampler. 

Fast State-Switching. In order to warrant the general use of the sticky model, 
one would like to know that the sticky parameter incorporated in the model does 
not preclude learning models with fast dynamics. To this end, we explored the 
performance of the sticky HDP-HMM on data generated from a model with a high 
probability of switching between states. Specifically, we generated observations 
from a four-state HMM with the following transition probability matrix: 



We once again used a truncation level L = 20. Since we are restricting ourselves 
to the blocked Gibbs sampler, it is no longer necessary to use a conjugate base 
measure. Instead we placed an independent Gaussian prior on the mean parameter 
and an inverse-Wishart prior on the variance parameter. For the Gaussian prior, we 
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(d) (e) (f) 

FIG 7. (a) Observation sequence (blue) and true state sequence (red) for a four-state HMM with 
fast state switching. For the original HDP-HMM using a blocked Gibbs sampler: (b) the median 
(solid blue) and 10 th and 90 th quantiles (dashed red) of Hamming distance between the true and 
estimated state sequences over the first 1,000 Gibbs samples from 200 chains, and (c) Hamming 
distance over 30, 000 Gibbs samples from three chains, (d) Histogram of the inferred self-transition 
parameter, p, for the sticky HDP-HMM blocked sampler, (e)-(f) Analogous plots to (b)-(c) for the 
sticky HDP-HMM. 



set the mean and variance hyperparameters to be equal to the empirical mean and 
variance of the entire dataset. The inverse-Wishart hyperparameters were set such 
that the expected variance is equal to 0.75 times that of the entire dataset, with three 
degrees of freedom. 

The results depicted in Fig. 7 confirm that by inferring a small probability of 
self-transition, the sticky HDP-HMM is indeed able to capture fast HMM dynam- 
ics, and just as quickly as the original HDP-HMM (although with higher variabil- 
ity.) Specifically, we see that the histogram of the self-transition proportion param- 
eter p for this dataset (see Fig. 7(d)) is centered around a value close to the true 
probability of self-transition, which is substantially lower than the mean value of 
this parameter on the data with high persistence (Fig. 6(c).) 

6.2. Multinomial Emissions. The difference in modeling power, rather than 
simply burn-in rate, between the sticky and original HDP-HMM is more pro- 
nounced when we consider multinomial emissions. This is because the multino- 
mial observations are embedded in a discrete topological space in which there is no 
concept of similarity between non-identical observation values. In contrast, Gaus- 
sian emissions have a continuous range of values in R n with a clear notion of 
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FIG 8. (a) Observation sequence (blue) and true state sequence (red) for a five-state HMM with 
multinomial observations, (b) Histogram of the predictive probability of test sequences using the in- 
ferred parameters sampled every 100 th iteration from Gibbs iterations 10,000 to 30,000 for the sticky 
and original HDP-HMM. The Hamming distances over 30,000 Gibbs samples from three chains are 
shown for the (c) sticky HDP-HMM and (d) original HDP-HMM. 



closeness between observations under the Lebesgue measure, aiding in grouping 
observations under a single HMM state's Gaussian emission distribution, even in 
the absence of a self-transition bias. 

To demonstrate the increased posterior uncertainty with discrete observations, 
we generated data from a five-state HMM with multinomial emissions with a 0.98 
probability of self-transition and equal probability of transitions to the other four 
states. The vocabulary, or range of possible observation values, was set to 20. The 
observation and true state sequences are shown in Fig. 8(a). We placed a symmetric 
Dirichlet prior on the parameters of the multinomial distribution, with the Dirichlet 
hyperparameters equal to 2 (i.e., Dir(2, . . . , 2).) 

From Fig. 8, we see that even after burn-in, many fast-switching state sequences 
have significant posterior probability under the non-sticky model leading to sweeps 
through regions of larger Hamming distance error. A qualitative plot of one such in- 
ferred sequence after 30,000 Gibbs iterations is shown in Fig. 1(c). Such sequences 
have negligible posterior probability under the sticky HDP-HMM formulation. 

In some applications, such as the speaker diarization problem that is explored 
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in Section 8, one cares about the inferred segmentation of the data into a set of 
state labels. In this case, the advantage of incorporating the sticky parameter is 
clear. However, it is often the case that the metric of interest is the predictive power 
of the fitted model, not the accuracy of the inferred state sequence. To study per- 
formance under this metric, we simulated 10 test sequences using the same pa- 
rameters that generated the training sequence. We then computed the likelihood 
of each of the test sequences under the set of parameters inferred at every 100*^ 
Gibbs iteration from iterations 10,000 to 30,000. This likelihood was computed by 
running the forward-backward algorithm of Rabiner (1989). We plot these results 
as a histogram in Fig. 8(b). From this plot, we see that the fragmentation of data 
into redundant HMM states can also degrade the predictive performance of the in- 
ferred model. Thus, the sticky parameter plays an important role in the Bayesian 
nonparametric learning of HMMs even in terms of model averaging. 

6.3. Comparison to Independent Sparse Dirichlet Prior. We have alluded to 
the fact that the shared sparsity of the HDP-HMM induced by /3 is essential for in- 
ferring sparse representations of the data. Although this is clear from the perspec- 
tive of the prior model, or equivalently the generative process, it is not immediately 
obvious how much this hierarchical Bayesian constraint helps us in posterior infer- 
ence. Once we are in the realm of considering a fixed, truncated approximation to 
the HDP-HMM, one might propose an alternate model in which we simply place a 
sparse Dirichlet prior, Dir(a/L, . . . , a/L) with a/L < 1, independently on each 
row of the transition matrix. This is equivalent to setting (3 = [l/L, . . . , 1/L] in 
the truncated HDP-HMM, which can also be achieved by letting the hyperparame- 
ter 7 tend to infinity. Indeed, when the data do not exhibit shared sparsity or when 
the likelihood cues are sufficiently strong, the independent sparse Dirichlet prior 
model can perform as well as the truncated HDP-HMM. However, in scenarios 
such as the one depicted in Fig. 9, we see substantial differences in performance 
by considering the HDP-HMM, as well as the inclusion of the sticky parameter. 
We explored the relative performance of the HDP-HMM and sparse Dirichlet prior 
model, with and without the sticky parameter, on such a Markov model with multi- 
nomial emissions on a vocabulary of size 20. We placed a Dir(0.1, . . . , 0.1) prior 
on the parameters of the multinomial distribution. For the sparse Dirichlet prior 
model, we assumed a state space of size 50, which is the same as the truncation 
level we chose for the HDP-HMM (i.e., L = 50). The results are presented in 
Fig. 10. From these plots, we see that the hierarchical Bayesian approach of the 
HDP-HMM does, in fact, improve the fitting of a model with shared sparsity. The 
HDP-HMM consistently infers fewer HMM states and more representative model 
parameters. As a result, the HDP-HMM has higher predictive likelihood on test 
data, with an additional benefit gained from using the sticky parameter. 
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(a) (b) 

FIG 9. (a) State transition diagram for a nine-state HMM with one main state ( labeled 1 ) and eight 
sub-states (labeled 2 to 9.) All states have a significant probability of self-transition. From the main 
state, all other states are equally likely. From a sub-state, the most likely non-self-transition is a 
transition is back to the main state. However, all sub-states have a small probability of transitioning 
to another sub-state, as indicated by the dashed arcs, (b) Observation sequence (top) and true state 
sequence (bottom) generated by the nine- state HMM with multinomial observations. 



Note that the results of Fig. 10(f) also motivate the use of the sticky parameter 
in the more classical setting of a finite HMM with a standard Dirichlet sparsity 
prior. A motivating example of the use of sparse Dirichlet priors for finite HMMs 
is presented in Johnson (2007). 

7. Multimodal Emission Densities. In many application domains, the data 
associated with each hidden state may have a complex, multimodal distribution. We 
propose to model such emission distributions nonparametrically, using a DP mix- 
ture of Gaussians. This formulation is related to the nested DP (Rodriguez et al., 
2008), which uses a Dirichlet process to partition data into groups, and then mod- 
els each group via a Dirichlet process mixture. The bias towards self-transitions 
allows us to distinguish between the underlying HDP-HMM states. If the model 
were free to both rapidly switch between HDP-HMM states and associate multiple 
Gaussians per state, there would be considerable posterior uncertainty. Thus, it is 
only with the sticky HDP-HMM that we can effectively fit such models. 

We augment the HDP-HMM state z t with a term s t indexing the mixture com- 
ponent of the z\ h emission density. For each HDP-HMM state, there is a unique 
stick-breaking measure i/jk ~ GEM(cr) defining the mixture weights of the k th 
emission density so that st ~ ip Zt - Given the augmented state (zt, st), the obser- 
vation y t is generated by the Gaussian component with parameter 6 Zt , St ■ Note that 
both the HDP-HMM state index and mixture component index are allowed to take 
values in a countably infinite set. See Fig. 5(b). 

7.1. Direct Assignment Sampler. Many of the steps of the direct assignment 
sampler for the sticky HDP-HMM with DP emissions remain the same as for the 
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FIG 10. (a) The true transition probability matrix (TPM) associated with the state transition diagram 
of Fig. 9. (b)-(c) The inferred TPM at the 30,000th Gibbs iteration for the sticky HDP-HMM and 
sticky sparse Dirichlet model, respectively, only examining those states with more than 1% of the 
assignments. For the HDP-HMM and sparse Dirichlet model, with and without the sticky parameter, 
we plot: (d) the Hamming distance error over 10,000 Gibbs iterations, (e) the inferred number of 
states with more than 1 % of the assignments, and (f) the predictive probability of test sequences 
using the inferred parameters sampled every 100 th iteration from Gibbs iterations 5,000 to 10,000. 



regular sticky HDP-HMM. Specifically, the sampling of the global transition dis- 
tribution /3, the table counts rrijk and rhjk, and the override variables Wjt are un- 
changed. The difference arises in how we sample the augmented state (zt, St). 

The joint distribution on the augmented state, having marginalized the transition 
distributions irk and emission mixture weights tpk, is given by 

p(z t = k,s t =j | z\ t , s\ t ,y 1:T , f3, a, a, k, A) = p(s t = j \ z t = k, z\ t , s\ t ,y 1:T , a, A) 
(7.1) p(z t = k | z\ t ,s\t,yi:T,P,a, «,A). 

We then block-sample (zt, sj) by first sampling zt, followed by st conditioned on 
the sampled value of z t . The term p(st = j \ z% = k, z\ t , s\ t ,yi-T,cr, A) relies 
on how many observations are currently assigned to the j th mixture component 
of state k. These conditional distributions are derived in the Supplementary Ma- 
terial (Fox et al., 2010), with the resulting Gibbs sampler outlined in Algorithm 
2. 

7.2. Blocked Sampler. To implement blocked resampling of (z± : t, si : t)> we 
use weak limit approximations to both the HDP-HMM and DP emissions, approx- 
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imated to levels L and L 1 , respectively. The posterior distributions for /5 and -k^ 
remain unchanged from the sticky HDP-HMM; that of ip^ is given by 

(7.2) Vfc I Zi-.T,si:T,a ~ Dir(a/L' + n' kl , . . . ,a/L' + n' kL ,), 

where n' M is the number of s t taking a value £ when z t = k. (i.e., the number of 
observations assigned to the k th state's I th mixture component.) The procedure for 
sampling the augmented state (zi-.t, si : t) is derived in the Supplementary Material 
(see Algorithm 4, Fox et al. (2010)). 

7.3. Assessing the Multimodal Emissions Model. In this section, we evaluate 
the ability of the sticky HDP-HMM to infer multimodal emission distributions rel- 
ative to the model without the sticky parameter. We generated data from a five-state 
HMM with mixture-of-Gaussian emissions, where the number of mixture compo- 
nents for each emission distribution was chosen randomly from a uniform distri- 
bution on {1,2,..., 10}. Each component of the mixture was equally weighted 
and the probability of self-transition was set to 0.98, with equal probabilities of 
transitions to the other states. The large probability of self-transition is what dis- 
ambiguates this process from one with many more HMM states, each with a single 
Gaussian emission distribution. The resulting observation and true state sequences 
are shown in Fig. 11(a) and (b). 

We once again used a non-conjugate base measure and placed a Gaussian prior 
on the mean parameter and an independent inverse-Wishart prior on the variance 
parameter of each Gaussian mixture component. The hyperparameters for these 
distributions were set from the data in the same manner as in the fast-switching 
scenario. Consistent with the sticky HDP-HMM concentration parameters 7 and 
(a + k), we placed a weakly informative Gamma(l, 0.01) prior on the concentra- 
tion parameter a of the DP emissions. All results are for the blocked sampler with 
truncation levels L = V = 20. 

In Fig. 1 1 , we compare the performance of the sticky HDP-HMM with DP emis- 
sions to that of the original HDP-HMM with DP emissions (i.e., DP emissions, but 
no bias towards self-transitions.) As with the multinomial observations, when the 
distance between observations does not directly factor into the grouping of obser- 
vations into HMM states, there is a considerable amount of posterior uncertainty 
in the underlying HMM state. Even after 30,000 Gibbs samples, there are still state 
sequence sample paths with very rapid dynamics. The result of this fragmenta- 
tion into redundant states is a slight reduction in predictive performance on test 
sequences, as in the multinomial emission case. See Fig. 1 1(b). 

8. Speaker Diarization Results. Recall the speaker diarization task from Sec- 
tion 2, which involves segmenting audio recordings from the NIST Rich Transcrip- 
tion 2004-2007 database into speaker-homogeneous regions while simultaneously 



THE STICKY HDP-HMM 



27 




Time Log-likelihood 




FIG 11. (a) Observation sequence (blue) and true state sequence (red) for a five-state HMM with 
mixture-of-Gaussian observations, (b) Histogram of the predictive probability of test sequences using 
the inferred parameters sampled every 100" 1 iteration from Gibbs iterations 10,000 to 30,000 for the 
sticky and original HDP-HMM. The Hamming distance over 30,000 Gibbs samples from three chains 
are shown for the (c) sticky HDP-HMM and (d) original HDP-HMM, both with DP emissions. 



identifying the number of speakers. In this section we present our results on apply- 
ing the sticky HDP-HMM with DP emissions to the speaker diarization task. 

A minimum speaker duration of 500ms was set by associating two preprocessed 
MFCCs with each hidden state. We also tied the covariances of within-state mix- 
ture components (i.e., each speaker-specific mixture component was forced to have 
identical covariance structure), and used a non-conjugate prior on the mean and 
covariance parameters. We placed a normal prior on the mean parameter with 
mean equal to the empirical mean and covariance equal to 0.75 times the empir- 
ical covariance, and an inverse-Wishart prior on the covariance parameter with 
1000 degrees of freedom and expected covariance equal to the empirical covari- 
ance. Our choice of a large degrees of freedom is akin to an empirical Bayes ap- 
proach in that it concentrates the mass of the prior in reasonable regions based on 
the data. Such an approach is often helpful in high-dimensional applied problems 
since our sampler relies on forming new states (i.e., speakers) based on param- 
eters drawn from the prior. Issues of exploration in this high-dimensional space 
increases the importance of the setting of the base measure. For the concentra- 
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tion parameters, we placed a Gamma(12, 2) prior on 7, a Gamma(6, 1) prior on 
a + k, and a Gamma(l, 0.5) prior on a. The self-transition parameter p was given 
a Beta(500, 5) prior. For each of the 21 meetings, we ran 10 chains of the blocked 
Gibbs sampler for 10,000 iterations for both the original and sticky HDP-HMM 
with DP emissions. We used a sticky HDP-HMM truncation level of L = 15, 
where the DP-mixture-of-Gaussians emission distribution associated with each of 
these L HMM states was truncated to V = 30 components. Our choice of L signif- 
icantly exceeds the typical number of speakers, which in the NIST database tends 
to be between 4 and 6. In practice, our sampler never approached using the full set 
of possible states and emission components. 

In order to explore the importance of capturing the temporal dynamics, we also 
compare our sticky HDP-HMM performance to that of a Dirichlet process mixture 
of Gaussians that simply pools together the data from each meeting ignoring the 
time indices associated with the observations. We considered a truncated Dirichlet 
process mixture model with L = 50 components and a Gamma(6, 1) prior on the 
concentration parameter 7. The base measure was set as in the sticky HDP-HMM. 

For the NIST speaker diarization evaluations, the goal is to produce a single 
segmentation for each meeting. Due to the label-switching issue (i.e., under our 
exchangeable prior, labels are arbitrary entities that do not necessarily remain con- 
sistent over Gibbs iterations), we cannot simply integrate over multiple Gibbs- 
sampled state sequences. We propose two solutions to this problem. The first, 
which we refer to as the likelihood metric, is to simply choose from a fixed set 
of Gibbs samples the one that produces the largest likelihood given the estimated 
parameters (marginalizing over state sequences), and then produce the correspond- 
ing Viterbi state sequence. This heuristic, however, is sensitive to overfitting and 
will, in general, be biased towards solutions with more states. 

An alternative, and more robust, metric is what we refer to as the minimum 
expected Hamming distance. We first choose a large reference set TZ of state se- 
quences produced by the Gibbs sampler and a possibly smaller set of test sequences 
T. Then, for each sequence «W in the test set T, we compute the empirical mean 
Hamming distance between the test sequence and the sequences in the reference 
set TZ; we denote this empirical mean by Hi. We then choose the test sequence 
z^> that minimizes this expected Hamming distance. That is, 



The empirical mean Hamming distance Hi is a label-invariant loss function since it 
does not rely on labels remaining consistent across samples — we simply compute 
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FIG 12. (a)-(c) For each of the 21 meetings, comparison of diarizations using sticky vs. original 
HDP-HMM with DP emissions. In (a) we plot the DERs corresponding to the Viterbi state sequence 
using the parameters inferred at Gibbs iteration 10,000 that maximize the likelihood, and in (b) the 
DERs using the state sequences that minimize the expected Hamming distance. Plot (c) is the same 
as (b), except for running the 10 chains for meeting 16 out to 50,000 iterations, (d)-(f) Comparison 
of the sticky HDP-HMM with DP emissions to the ICS! errors under the same conditions. 



where Hamm is the Hamming distance between sequences z w and z^' 

after finding the optimal permutation of the labels in test sequence to those in 
reference sequence z«l At a high level, this method for choosing state sequence 
samples aims to produce segmentations of the data that are typical samples from 
the posterior. Jasra et al. (2005) provides an overview of some related techniques 
to address the label-switching issue. Although we could have chosen any label- 
invariant loss function to minimize, we chose the Hamming distance metric be- 
cause it is closely related to the official NIST diarization error rate (DER) that is 
calculated during the evaluations. The final metric by which the speaker diariza- 
tion algorithms are judged is the overall DER, a weighted average over the set of 
meetings based on the length of each meeting. 

In Fig. 12(a), we report the DER of the chain with the largest likelihood given 
the parameters estimated at the 10, OOO*' 1 Gibbs iteration for each of the 21 meet- 
ings, comparing the sticky and original HDP-HMM with DP emissions. We see that 
the sticky model's temporal smoothing provides substantial performance gains. Al- 
though not depicted in this paper, the likelihoods based on the parameter estimates 
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under the original HDP-HMM are almost always higher than those under the sticky 
model. This phenomenon is due to the fact that without the sticky parameter, the 
HDP-HMM over-segments the data and thus produces parameter estimates more 
finely tuned to the data resulting in higher likelihoods. Since the original HDP- 
HMM is contained within the class of sticky models (i.e., when k = 0), there 
is some probability that state sequences similar to those under the original model 
will eventually arise using the sticky model. Thus, since the parameters associ- 
ated with these fast-switching sequences result in higher likelihood of the data, the 
likelihood metric is not very robust — one would expect the performance under the 
sticky model to degrade given enough Gibbs chains and/or iterations. In Fig. 12(b), 
we instead report the DER of the chain whose state sequence estimate at Gibbs 
iteration 10,000 (this defines the test set T) minimizes the expected Hamming dis- 
tance to the sequences estimated every 100 Gibbs iteration, discarding the first 
5,000 iterations (this defines the reference set 1Z). Due to the slow mixing rate of 
the chains in this application, we additionally discard samples whose normalized 
log-likelihood is below 0.1 units of the maximum at Gibbs iteration 10,000. From 
this figure, we see that the sticky model still significantly outperforms the original 
HDP-HMM, implying that most state sequences produced by the original model 
are worse, not just the one corresponding to the most-likely sample. Example max- 
imum likelihood and minimum expected Hamming distance diarizations are dis- 
played in Fig. 13. One noticeable exception to this trend is the NIST_20051102- 
1323 meeting (meeting 16). For the sticky model, the state sequence using the 
maximum likelihood metric had very low DER (see Fig. 13(b)); however, there 
were many chains that merged speakers and produced segmentations similar to the 
one in Fig. 13(c), resulting in such a sequence minimizing the expected Hamming 
distance. See Section 9 for a discussion on the issue of merged speakers. Running 
meeting 16 for 50,000 Gibbs iterations improved the performance, as depicted by 
the revised results in Fig. 12(c). We summarize our overall performance in Table 1, 
and note that (when using the 50,000 Gibbs iterations for meeting 16 and 10,000 
Gibbs iterations for all other meetings 2 ) we obtain an overall DER of 17.84% using 
the sticky HDP-HMM versus the 23.91% of the original HDP-HMM model. Al- 
ternatively, when constrained to single Gaussian emissions the sticky HDP-HMM 
and original HDP-HMM have overall DERs of 34.97% and 36.89%, respectively, 
which clearly demonstrates the importance of considering DP emissions. When 
considering the DP mixture-of-Gaussians model (ignoring the time indices asso- 

2 On such a large dataset, running 10 chains for 50,000 iterations for each of the 21 meetings 
would have represented a significant computational burden and thus we only ran the chains to 50,000 
iterations for meeting 16, which clearly had not mixed after 10,000 iterations (based on an exami- 
nation of trace plots of log-likelihoods; see Fig. 15). In meeting 16 the differences between two of 
the speakers are especially subtle, and our sampler has difficulty in reliably finding parameters that 
separate these speakers. 
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FIG 13. Qualitative results for meetings AMIJ.0041210-1052 (meeting 1, top), CMUJ20050228- 
1615 (meeting 3, middle), and NIST_20051 102-1323 meeting (meeting 16, bottom), (a) True state se- 
quence with the post-processed regions of overlapping- and non- speech time steps removed, (b)-(c) 
Plotted only over the time-steps as in (a), the state sequences inferred by the sticky HDP-HMM with 
DP emissions at Gibbs iteration 10,000 chosen using the most likely and minimum expected Ham- 
ming distance metrics, respectively. Incorrect labels are shown in red. For meeting 1, the maximum 
likelihood and minimum expected Hamming distance diarizations are similar whereas in meeting 3 
we clearly see the sensitivity of the maximum likelihood metric to overfitting. The minimum expected 
Hamming distance diarization for meeting 16 has more errors than that of the maximum likelihood 
due to poor mixing rates and many samples failing to identify a speaker. 



ciated with the observations), the overall DER is 72.67%. If one uses the ground 
truth labels to map multiple inferred DP mixture components to a single speaker 
label, the overall DER drops to 54.19%. The poor performance of the DP mixture- 
of-Gaussians model, even when assuming that ground truth labels are available, 
which would not be the case in practice, illustrates the importance of the temporal 
dynamics captured by the HMM. 

As a further comparison, the algorithm that was by far the best performer at the 
2007 NIST competition — the algorithm developed by a team at the International 
Computer Science Institute (ICSI) (Wooters and Huijbregts, 2007) — has an over- 
all DER of 18.37%. The ICSI team's algorithm uses agglomerative clustering, and 
requires significant tuning of parameters on representative training data. In con- 
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Overall DERs (%) 


Min Hamming 


Max Likelihood 


2-Best 


5-Best 


Sticky HDP-HMM 


19.01 (17.84) 


19.37 


16.97 


14.61 


Non-Sticky HDP-HMM 


23.91 


25.91 


23.67 


21.06 



Table 1 

Overall DERs for the sticky and original HDP-HMM with DP emissions using the minimum 
expected Hamming distance and maximum likelihood metrics for choosing state sequences at Gibbs 
iteration 10,000. For the maximum likelihood criterion, we show the best overall DER if we 
consider the top two or top five most-likely candidates. The number in the parentheses is the 
performance when running meeting 16 for 50,000 Gibbs iterations. The overall ICSI DER is 
18.37%, while the best achievable DER with the chosen acoustic pre-processing is 10.57%. 



trast, our hyperparameters are automatically set meeting-by-meeting, as outlined 
at the beginning of this section. An additional benefit of the sticky HDP-HMM 
over the ICSI approach is the fact that there is inherent posterior uncertainty in this 
task, and by taking a Bayesian approach we are able to provide several interpreta- 
tions. Indeed, when considering the best per-meeting DER for the five most likely 
samples, our overall DER drops to 14.61% (see Table 1). Although not helpful 
in the NIST evaluations, which requires a single segmentation, providing multiple 
segmentations could be useful in practice. 

To ensure a fair comparison, we use the same speech/non-speech pre-processing 
and acoustic features as ICSI, so that the differences in our performance are due 
to changes in the identified speakers. As depicted in Fig. 14, both our performance 
and that of ICSI depend significantly on the quality of this pre-processing step. 
For the periods of non-speech that are incorrectly identified as speech during pre- 
processing, we are forced to produce errors on these sections since they will be 
assigned an HMM label (and thus a speaker label) that is separate from the label 
assigned to the pre-processed sections labeled as non-speech. Another source of er- 
rors are periods of overlapping speech, which impede our ability to clearly identify 
a single speaker. In Fig. 14(a), we compare the meeting-by-meeting DERs of the 
sticky HDP-HMM, the original HDP-HMM, and the ICSI algorithm. If we use the 
ground truth speaker labels for the post-processed data (assigning undetected non- 
speech a label different than the pre-processed non-speech), the resulting overall 
DER is 10.57% with meeting-by-meeting DERs displayed in Fig. 14(b). This num- 
ber provides a lower bound on the achievable performance using the speech/non- 
speech preprocessing, our block-averaging of features, and our assumptions of 
minimum duration. Beyond these forced errors, it is clear from Fig. 14(a) that the 
sticky HDP-HMM with DP emissions provides performance comparable to that of 
the ICSI algorithm while the original HDP-HMM with DP emissions performs sig- 
nificantly worse. Overall, the results presented in this section demonstrate that the 
sticky HDP-HMM with DP emissions provides an elegant and empirically effective 
speaker diarization method. 
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FIG 14. (a) Chart comparing the DERs of the sticky and original HDP-HMM with DP emissions to 
those of ICSI for each of the 21 meetings. Here, we chose the state sequence at the 10, 000 i ' 1 Gibbs 
iteration that minimizes the expected Hamming distance. For meeting 16 using the sticky HDP-HMM 
with DP emissions, we chose between state sequences at Gibbs iteration 50,000. (b) DERs associated 
with using ground truth speaker labels for the post-processed data. Here, we assign undetected non- 
speech a label different than the pre-processed non-speech. 



9. Discussion. We have developed a Bayesian nonparametric approach to the 
problem of speaker diarization, building on the HDP-HMM presented in Teh et al. 
(2006). Although the original HDP-HMM does not yield competitive speaker di- 
arization performance due to its inadequate modeling of the temporal persistence 
of states, the sticky HDP-HMM that we have presented here resolves this problem 
and yields a state-of-the-art solution to the speaker diarization problem. 

We have also shown that this sticky HDP-HMM allows a fully Bayesian non- 
parametric treatment of multimodal emissions, disambiguated by its bias towards 
self-transitions. Accommodating multimodal emissions is essential for the speaker 
diarization problem and is likely to be an important ingredient in other applications 
of the HDP-HMM to problems in speech technology. 

We also presented efficient sampling techniques with mixing rates that improve 
on the state-of-the-art by harnessing the Markovian structure of the HDP-HMM. 
Specifically, we proposed employing a truncated approximation to the HDP and 
block-sampling the state sequence using a variant of the forward-backward algo- 
rithm. Although the blocked samplers yield substantially improved mixing rates 
over the sequential, direct assignment samplers, there are still some pitfalls to these 
sampling methods. One issue is that for each new considered state, the parameter 
sampled from the prior distribution must better explain the data than the parameters 
associated with other states that have already been informed by the data. In high- 
dimensional applications, and in cases where state-specific emission distributions 
are not clearly distinguishable, this method for adding new states poses a signif- 
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FIG 15. Trace plots of (a) log-likelihood, (b) Hamming distance error, and (c) number of speakers 
for 10 chains for two meetings: CMU '3,00509 12-0900 / meeting 5 (top) and NISTJ20051102-1323 / 
meeting 16 (bottom). For meeting 5, which has behavior representative of the majority of the meet- 
ings, we show traces over the 10,000 Gibbs iterations used for the results in Section 8. For meeting 
16, we ran the chains out to 100,000 Gibbs iterations to demonstrate the especially slow mixing rate 
for this meeting. The dashed blue vertical lines indicate 10,000 iterations. 



icant challenge. Indeed, both issues arise in the speaker diarization task and we 
did have difficulties with mixing. Further evidence of this is presented in the trace 
plots in Fig. 15, where we plot log-likelihoods, Hamming distances, and speaker 
counts for 10,000 Gibbs sampling iterations of meeting 5 and 100,000 iterations of 
meeting 16. As discussed previously, meeting 16 is the most problematic meeting 
in our data set, and these plots provide clear evidence that our sampler is not mix- 
ing on this meeting. But even on meeting 5, which is more representative of the 
full set of meetings and which is segmented effectively by our procedure, we see a 
relatively slow evolution of the sampler, particularly as measured by the number of 
speakers. Our use of the minimum expected Hamming distance procedure to select 
samples mitigates this difficulty, but further work on sampling procedures for the 
sticky HDP-HMM is needed. One possibility is to consider split-merge algorithms 
similar to those developed in Jain and Neal (2004) for the DP mixture model. 

A limitation of the HMM in general is that the observations are assumed condi- 
tionally i.i.d. given the state sequence. This assumption is often insufficient in cap- 
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turing the complex temporal dependencies exhibited in real-world data. Another 
area of future work is to consider Bayesian nonparametric versions of models bet- 
ter suited to such applications, like the switching linear dynamical system (SLDS) 
and switching VAR process. A first attempt at developing such models is presented 
in Fox et al. (2009). An inspiration for the sticky HDP-HMM actually came from 
considering the original HDP-HMM as a prior for an SLDS. In such scenarios 
where one does not have direct observations of the underlying state sequence, the 
issues arising from not properly capturing state persistence are exacerbated. The 
sticky HDP-HMM presented in this paper provides a robust building block for de- 
veloping more complex Bayesian nonparametric dynamical models. 

Acknowledgements. We thank O. Vinyals, G. Friedland, and N. Morgan for 
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SUPPLEMENTARY MATERIAL 

Notational Conventions and Derivations of Gibbs Samplers 

(doi: ???http://lib.stat.cmu.edu/aoas/???/???; .pdf). We present detailed derivations 
of the conditional distributions used for both the direct assignment and blocked 
Gibbs samplers, as well as associated pseudo-code. We also provide a list of nota- 
tional conventions used throughout the paper. 
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APPENDIX A: NOTATIONAL CONVENTIONS 
General Notation 



Z + the set of positive integers 

R the set of reals 

xi-t the sequence {x\, . . . , x t } 

x\ t the sequence {x±, . . . , Xt-i, :ct+i, • • • , xt}, where T is largest 
possible index 

%-b x o,b 

| • | cardinality of a set 

5(k, j) the discrete Kronecker delta 

5q measure concentrated at 9 

E[-\ expectation of a random variable 

DP(a, H) Dirichlet process distribution with concentration parameter a and 
base measure H 

Dir(a) if -dimensional finite Dirichlet distribution with parameter a 

Ber(p) Bernoulli distribution with parameter p 

GEM (7) stick-breaking distribution with parameter 7 
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Hierarchical Dirichlet Process and CRF with Loyal Customers 



Vji 


i observation within j group 


Zji 


index oi mixture component that generated observation yji 


hi 


(non-unique) parameter associated with observation yji 


0* 


(non-unique) parameter, or dish, served at table t m restaurant j 




7 / n ' ill . c j i • . ii 

k unique global parameter oi the mixture model 


i 

T .ji 


table assignment for observation, or customer, 


7 

k jt 


considered dish assignment ror table t m restaurant j 


k jt 


11*1 ' i /* A 1 t 1 * 1 

served dish assignment ror table £ in restaurant j 


h 


the set of all considered dish assignments in restaurant j 


h 


the set of all served dish assignments in restaurant j 


w jt 


override variable for table t in restaurant j 


fljt 


number of customers at table t in restaurant j 


rrijk 


number of tables in restaurant j that considered dish k 


m jk 


number of tables in restaurant j that were served dish k 


Tj 


number of currently occupied tables in restaurant j 


K 


number of unique dishes considered in the franchise 


K 


number of unique dishes served in the franchise 
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Sticky HDP-HMM 

yt observation from the hidden Markov model at time t 

zt state of the Markov chain at time t 

rijk number of transitions from state j to state k in z\ : t 

n~ k number of transitions from state j to state k in Z\-.t, not counting 

the transitions zt~\ — > Zt or zt — > z t +\ 
k self-transition parameter 

p self-transition proportion parameter k/(ol + k) 

with DP emissions 

St index of mixture component that generated observation y t 

n' k - number of observations assigned to the k th state's j th mixture 
component 

n'j"* number of observations assigned to the k th state's j th mixture 

component, not counting observation y t 
K' k number of currently instantiated mixture components for the k th 

state's emission distribution 



THE STICKY HDP-HMM 



41 



APPENDIX B: VARIANTS ON THE CHINESE RESTAURANT PROCESS 

In this appendix, we elaborate upon the Chinese restaurant analogies associated 
with the HDP and the sticky HDP-HMM. These analogies greatly aid in developing 
the Gibbs samplers derived in this Supplementary Material. 

B.l. Chinese Restaurant Franchises. Recall the hierarchical Dirichlet pro- 
cess (HDP) of Sec. 4. Teh et al. (2006) have described the marginal probabilities 
obtained from integrating over the random measures Go and {Gj} in terms of a 
Chinese restaurant franchise (CRF) that is an analog of the Chinese restaurant 
process of Sec. 3. The CRF is comprised of J restaurants, each corresponding to 
an HDP group, and an infinite buffet line of dishes common to all restaurants. The 
process of seating customers at tables, however, is restaurant specific. We introduce 
indicator variables tji and kjt to represent table and dish assignments. There are J 
restaurants (groups), each with infinitely many tables (clusters) at which customers 
(observations) sit. Each customer is pre-assigned to a given restaurant determined 
by that customer's group j. The table assignment for the i th customer in the j 
restaurant is chosen as tji ~ ttj, and each table is assigned a dish (parameter) via 
kjt ~ /3. One can think of /3 as a set of ratings for the dishes served in the buffet 
line. Observation yji is then generated by global parameter 6'^ = 6* t = kjt • 
The generative model is summarized below and is depicted as a graphical model in 
Fig. 16(a): 

(B.l) |/3~/3 tjilitj^Tj yji | {9 k }kLi,{kjt}™i,tji ~ F{0 k ). 

Marginalizing over the stick-breaking measures ttj and /3 yields the following 
predictive distributions that describe the CRF: 

(B.2) p(tji | tji, tji-!, a) oc h jt 5(tji,t) + aS(tji,Tj + 1) 

t=i 

K 

p(k jt | k x ,k 2 , . . . ,kj_ v kji, . . . , feji_i,7) oc s ^m. k 5{kj tl k) + ^5{k jt ,K + 1), 

k=i 

where m. k = J2j m jk and kj = {kji, . . . , kj^}- Here, hj t denotes the number of 
customers in restaurant j sitting at table t, mjk the number of tables in restaurant 
j serving dish k, Tj the number of currently occupied tables in restaurant j, and K 
the total number of unique dishes being served in the franchise. Eq. (B.2) implies 
that upon entering the j th restaurant in the CRF, customer yji sits at currently 
occupied tables tji with probability proportional to the number of currently seated 
customers, or starts a new table Tj + 1 with probability proportional to a. The first 
customer to sit at a table goes to the buffet line and picks a dish kjt for their table, 
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(a) (b) 

FIG 16. Graphical model of (a) CRF, and (b) CRF with loyal customers. Customers yji sit at table 
tji\-kj ~ -Kj. In the CRF, each table chooses a dish kjt \/3 ~ j3 while in the CRF with loyal customers 
tables consider a dish kjt\/3 ~ f3, but override variables Wjt\a, K ~ Ber(K,/(a + ft)) can force the 
served dish kjt to be j. See Sec. B.2. 



choosing the dish with probability proportional to the number of times that dish has 
been picked previously, or ordering a new dish 9k+i with probability proportional 
to 7. The intuition behind this predictive distribution is that integrating over the dish 
ratings /3 results in customers making decisions based on the observed popularity 
of the dishes. 

B.2. Chinese Restaurant Franchises with Loyal Customers. We extend the 
Chinese restaurant metaphor to the sticky HDP-HMM, where our franchise now 
has restaurants with loyal customers. In addition to providing intuition for the 
predictive distribution on assignment variables, developing this metaphor aids in 
constructing the Gibbs samplers of Sec. 5.1 and Sec. 5.2. In the CRF with loyal 
customers, each restaurant in the franchise has a specialty dish with the same index 
as that of the restaurant. Although this dish is served elsewhere, it is more popular 
in the dish's namesake restaurant. We see this increased popularity in the specialty 
dish from the fact that a table's dish is now drawn from the modified dish ratings: 

(B.3) k jt \a,K,P ■ J -. 

a + k 

Specifically, we note that each restaurant has a set of restaurant-specific ratings of 
the buffet line that redistributes the shared ratings /3 so that there is more weight on 
the house-specialty dish. 

Recall that while customers in the CRF of the HDP are pre-partitioned into 
restaurants based on the fixed group assignments, in the HDP-HMM the value of 
the state zt determines the group assignment (and thus restaurant) of customer 
yt+i- Therefore, we describe a generative process that first assigns customers to 
restaurants and then assigns customers to tables and dishes. We refer to z t as the 
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parent and zt+i as the child. The parent enters a restaurant j determined by its 
parent (the grandparent), z t -i = j- We assume there is a bijective mapping / : 
t — > ji of time indices t to restaurant/customer indices ji. The parent then chooses 
a table tji ~ 7Cj and that table is served a dish indexed by kj t . Noting that z t = 
Zji = kj tji (i.e., the value of the state is the dish index), the increased popularity 
of the house specialty dish implies that children are more likely to eat in the same 
restaurant as their parent and, in turn, more likely to eat the restaurant's specialty 
dish. This develops family loyalty to a given restaurant in the franchise. However, 
if the parent chooses a dish other than the house specialty, the child will then go to 
the restaurant where this dish is the specialty and will in turn be more likely to eat 
this dish, too. One might say that for the sticky HDP-HMM, children have similar 
tastebuds to their parents and will always go to the restaurant that prepares their 
parent's dish best. Often, this keeps many generations eating in the same restaurant. 

The inference algorithm for the sticky HDP-HMM, which is derived in Sec. 5.1, 
is simplified if we introduce a set of auxiliary random variables kjt and Wjt as 
follows: 



where Ber(p) represents the Bernoulli distribution with parameter p. Here, we have 
defined a self-transition parameter p = k/(ol + k). The table first chooses a dish kjt 
without taking the restaurant's specialty into consideration (i.e., the original CRF). 
With some probability, this considered dish is overridden (perhaps by a waiter's 
suggestion) and the table is served the specialty dish j. Thus, kjt represents the 
served dish. We refer to Wjt as the override variable. For the original HDP-HMM, 
when k = 0, the considered dish is always the served dish since Wjt = for all 
tables. 

A graphical model representation of the sticky HDP-HMM is shown in Fig. 16(b). 
Although not explicitly represented in this graph, the sticky HDP-HMM still in- 
duces a Markov structure on the indicator random variables zt, which, based on the 
value of the parent state Zt-\, are mapped to a group-specific index ji. 

One can derive a distribution on partitions by marginalizing over the stick- 
breaking distributed measures tTj and /3, just as in the HDP. The CRF with loyal 



(B.4) 



kjt | P 
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customers is then described by: 

(B.5) p(tji | tji, . . . ,tji_i,a,K) oc y~)njt8(tjj,t) + (a + K,)6(tji,Tj + 1) 

t=i 
R 

p(kj t ,k j _ 1 ,k jl , . . . ,fcj t _i,7) oc ^2m. k 5(k jt ,k) + jS(k jt ,K + 1), 

fc=i 

where is the number of tables in restaurant j that considered dish fc, and A' 
the number of unique considered dishes in the franchise. The distributions on Wjt 
and kj t remain as before, so that considered dishes are sometimes overridden by 
the house specialty. 
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APPENDIX C: DIRECT ASSIGNMENT SAMPLER 

This supplementary material provides the derivations for the sequential, direct 
assignment Gibbs samplers for the sticky HDP-HMM and sticky HDP-HMM with 
DP emissions. Throughout this section, we will refer to the random variables in 
the graph of Fig. 16(b). For these derivations we include the n term of the sticky 
HDP-HMM; the derivations for the original HDP-HMM follow directly by setting 
k = 0. The resulting Gibbs samplers are outlined in Algorithms 1 and 2. 

C.l. Sticky HDP-HMM. To derive the direct assignment sampler for the sticky 
HDP-HMM, we first assume that we sample: table assignments for each customer, 
tji, served dish assignments for each table, kjt', considered dish assignments, kjt, 
dish override variables, Wjt, and the global mixture weights, ft. Because of the 
properties of the HDP, and more specifically the stick-breaking distribution, we are 
able to marginalize the group-specific distributions ttj and parameters 9k and still 
have closed-form distributions from which to sample (since exchangeability im- 
plies that we may treat every table and dish as if it were the last, as in Eq. (B.5).) 
The marginalization of these variables is referred to as Rao-Blackwellization (Casella and Robert, 
1996). The assumption of having tji and kjt is a stronger assumption than that of 
having zji since Zji can be uniquely determined from tji and kjt, though not vice 
versa. We proceed to show that directly sampling Zji instead of tji and kjt is suffi- 
cient when the auxiliary variables rrijk, rhjk, an d Wjt are additionally sampled. 

C.l.l. Sampling z t . The posterior distribution of z t factors as: 

p(zt = k | z\ t , yi-.T, ft, a, k) oc / ~[p(TTi | a, /3, k) p(z T I TT ZT _ 1 )dn 

[l[He k \\)l[f(y T \e Zr )do 

J k t 

(C.l) ocp(z t = k | z\ t ,ft,a,K)p(y t \ y\ t ,z t = k,z\ t ,X). 

Here, /(■ | 9) is the conditional density associated with the likelihood distribution 
F(9) and h(- | A) with the base measure H(X). 

The term p{z t = k \ z\t, ft, a, k), which arises from integration over 7r, is a vari- 
ant of the Chinese restaurant franchise prior, while p(y t \ j/w, z t= k, 2w, A) is the 
likelihood of an assignment z t = k having marginalized the parameter 9^. 

The conditional distribution p(z t = k \ z\ t , ft, a, k) of Eq. (C.l) can be written 
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as: 



p(z t = k\ z\ t ,P,a,n) <x / p(z t+1 \ ir k )p(z t = k \ ^ Zt -i) 



7T 



JJ(p(7Ti | a,P,K) p(z T | TTi))dTV 

i r\z T -i=i,T^t,t+l 

(C2) oc / p(z t+ i | ir k )p(zt = k I Ttzt^) 

JIT 

~[p(iTi | {z T | z T _i = i,r j^t,t+ l},P,a,K)dn. 

i 

Let zt-i = j- If k ^ j, that is, assuming a change in state value at time t, then 

p{z t = k | z\ t ,/3,a,n) 

OC / p(z t +l | TTk)p(^k I {^r I 2-r- 1 =fc,T^t,t+l}, /3, a, K)d,TT k 
I p( Zf = k \ TTj)p(TTj | {z T | Z T _l =j,T^t,t + 1}, /3, a, «)d7Tj 

(C.3) oc p(z t+ i I {z T I z T -\ = k,r ^ t,t + l},/3, a, k) 

p(z t = k j {z T | z T _i =j,r^M+ 1}, /3, a, k). 

When considering the probability of a self-transition (i.e., k = j), we have 

P(Zi = j I Z\t,P, a,K) (X / p{z t+ l | TTjXzt = j | 7Tj) 



J>(7Tj | {z T | Z T _i =fc,T^t,i+l}, /?, a, K)dlTj 

(C.4) oc = j,z t+ i I {z r [ z T _i = k,r ^ t,t + 1},/3,q,k). 

These predictive distributions can be derived by standard results arising from 
having placed a Dirichlet prior on the parameters defining these multinomial ob- 
servations z T . The finite Dirichlet prior is induced by considering the finite partition 
{1, . . . , K, Ar} of Z+, where A^ = {K+ 1, K+2, . . . } is the set of unrepresented 
state values in z\ t . The properties of the DP dictate that on this finite partition, we 
have the following form for the group-specific transition distributions: 

(C.5) TTj I a,f3 ~ Dir(a/3i, . . . ,a(3j + k, . . . , a(3 K , a/%), 

where fit = YI^Lk+i Pi- Using this prior, we derive the distribution of a generic 
set of observations generated from a single transition distribution 7Tj given the hy- 
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perparameters a, (3, and k: 




p({z T | z T -i = i} | ft, a, k) = / p{ni | /3,a,K)p({z T | z r _i = i} \ n^dni 



r (Efc afffc + ^(fc, i)) J]fc r ("/^fc + i) + ^jfc) 
U k T(a/3 fc + «£(*:, 0) a/3 fc + k<5(£;, i) + n jk ) 

T(a + k) -i-r T(a(3 k + n5(k,i) + n jk ) 
T{a + K + m.) A A r(a/3 fc + /c£(fc,i)) ' 



where we make a slight abuse of notation in taking Pk+i = /%• We use Eq. (C.6) 
to determine that the first component of Eq. (C.3) is 

p(zt = k | {z T j z T _i = j,r ^t,t + a, n) 

_ p({z T | gr-i = J, r ^ t + 1, z t = k} | ff, a, k) 
p{{z T I ^ r _i = j, t ^ i, t + 1} | ft, a, k) 
r(a + k + n"*) r(a/3 fc + « + rw* + 1) 



Here, nj k denotes the number of transitions from state j to k not counting the 
transition from Zt-\ to z t or from z t to zt+i- Similarly, the second component of 
Eq. (C.3) is derived to be 



(C.7) 



r(a + nf + 1) T{ap k + ng) 
aP k + n~ fc * 



(C.8) 



p(z t +i = £ | {z T | z T _i = k,r ^ t,t + 1}, /3, a, «) 



-i 
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For k = j, the distribution of Eq. (C.4) reduces to 



p(z t = j, zt+i | {z T \ z T _i = j, t ^ t, t + 1}, P, a, k) 

= p({z T I gr-i = j} I /3,a,fiQ 

p({z T | z r _i = j, r 7^ t, t + 1} |/3, a, re) 

f r(a+ K +n-') r(aft+K+nT i t +l) r(aft+nj/+l) 
r(a+K+n- t +2) r( Q /3j+K+n-/) r (aft +«."') ' 



r(a+K+n-*) r(«^+ft+n- j t +2) 
k r(a+K+n-'+2) r(aft+K+nT/) ' 



(a+K+nT t +l)(a+ K +n- t ) ' 
(a/3 j +K+n-*+l)(a/3 j +K+n7 t ) 

(a+K+n-'+lJCa+K+n-') : 



z f+ i = £,£ ^ j; 



zt+i = j; 



(C.9) 



{aPj + k + nJ){aPi + nj + (« + l)<5(j, £)) 



(a + k + n j t )(a + k + n^. 1 + 1) 
Combining these cases, the prior predictive distribution of zt is: 

(CIO) p(z t = k | z\ t , P, a, k) 

(ap k + n~l_ xk + K8(zt-i,k)) 



oc 



+«;5(fe,%+i)4-5(2; t _i,fc)(5(fe,«t+i) 



a+n fc . +K+<5(2 t _i,fc) 



fee 

fc = ir + l. 



The conditional distribution of the observation y t given an assignment z t = k 
and given all other observations y T , having marginalized out 6 k , can be written as 
follows: 

P(yt\y\t,zt = k,z\ t ,X) oc J f(yt\0 k )h(9 k \ A) JJ f(y T \0 k )d6 k 

r\z T =k,T^t 

x J fivt I 0fcM0fc I {?/t | z r = k,T / t}, X)dO k 
(C.ll) oc p(y t | {?/r I z T = k,T ^ t}, A). 

There exists a closed-form distribution for this likelihood if we consider a conju- 
gate distribution on the parameter space G. 

Assuming our emission distributions are Gaussian with unknown mean and co- 
variance parameters, the conjugate prior is normal-inverse-Wishart distribution, 
which we denote by AfTW(C, v, A). Here, A = {£, i/, A}. Via conjugacy, 
the posterior distribution of 9 k = {fx k ,T, k } given a set of Gaussian observations 
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yt ~ W(/Ufc, Efe) is distributed as an updated normal-inverse-Wishart MZW(Ck,'&k, u k 
where 

Ck = C + \{Vs \z s = k,s^t}\^(+ \Y k \ 
Vk = v + \Y k \ 

Ckh = C# + Yl Vs 

ys& k 

u k A k = uA + Y, y s y T s + (W T - Ckh$I- 

y s & k 

Marginalizing 6 k induces a multivariate Student-t predictive distribution for y t 
(Gelman et al., 2004): 

p(Ut I {Vt I z T = k,r ^ t}, C, i9, v, A) = t Vh _ d _ x [y t ; & k , - ^ + \^— A k 



(k{vk - d-l) 



(C12) =t i>k (y t ;fi k ,T, k ). 



C.1.2. Sampling p. Let K be the number of unique dishes considered. We 
note that for the sticky HDP-HMM, every served dish had to be considered in 
some restaurant. The only scenario in which this would not be the case is if for 
some dish j, every table served dish j arose from an override decision. However, 
overrides resulting in dish j being served can only occur in restaurant j, and this 
restaurant would not exist if dish j was not considered (and thus served) in some 
other restaurant. Therefore, each served dish had to be considered by at least one 
table in the franchise. On the other hand, there may be some dishes considered that 
were never served. From this, we conclude that K > K. We will assume that the 
K served dishes are indexed in {1, . . . ,K} and any considered, but not served, 
dish is indexed in {K + 1, K + 2, . . . }. For the sake of inference, we will see in 
the following section that K never exceeds K, the number of unique considered 
dishes, implying that K = K. 

Take a finite partition {6\, 02, ■ ■ ■ , @ k } of the parameter space 0, where 

®fc = ©\UfcLi{^fc} i s tri e set of all currently unrepresented parameters. By def- 
inition of the Dirichlet process, Go has the following distribution on this finite 
partition: 

(G o (0i), . . . , G (e R ), G (e- k )) I J,H~ Dir( 7 F ■ ■ ■ , 7#(%), jH(@~ k )) 
(C.13) ~EHr(0,...,0, 7 ), 



where we have used the fact that H is absolutely continuous with respect to Lebesgue 
measure. 
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For every currently instantiated table t, the considered dish assignment variable 
kjt associates the table-specific considered dish 9* t with one among the unique 
set of dishes {6\, . . . , 6^}. Recalling that fhjk denotes how many of the tables in 
restaurant j considered dish 9k, we see that we have fh.k observations 9* t ~ Go in 
the franchise that fall within the single-element cell {9k}. By the properties of the 
Dirichlet distribution, the posterior of Go is 



(C.14) (G o (0i), . . . , G (%), Go(e^))|0*,7 ~ Dir(m. l5 . . . ,fh. R ,j). 

Since (G (6>i), . . . , G (9 K ), G (6^)) is by definition equal to (ft, . . . ,ftf,ft), 
and from the conditional independencies illustrated in Fig. 16, the desired posterior 

of j3 is 



where here we define ft; = Ylfc=K+i ftk- From the above, we see that {m.k}k=i 
is a set of sufficient statistics for resampling /3 defined on this partition. Thus, it is 
sufficient to sample fhjk instead of tji and kjt, when given the state index zt. The 
sampling of fhjk, as well as the resampling of hyperparameters (see Supplementary 
Material E), is greatly simplified by additionally sampling auxiliary variables rrijk 
and Wjt, corresponding to the number of tables in restaurant j that were served 
dish k and the corresponding override variables. 

C.1.3. Jointly Sampling rrijk, Wjt, and fhjk- We jointly sample the auxiliary 
variables rrijk, Wju and fhjk from 

(C.16) p(m, w,fh | Zi : T, ft a, n) = p(fh | m, w, z± : t, ft a, k) 



We start by examining p(ra \ z± : t, ft, a, k). Having the state index assignments 
zi-t effectively partitions the data (customers) into both restaurants and dishes, 
though the table assignments are unknown since multiple tables can be served 
the same dish. Thus, sampling rrijk i s m effect equivalent to sampling table as- 
signments for each customer after knowing the dish assignment. This conditional 
distribution given by: 

pitji = t | kj t = k, t" j \ k~ jt ,y 1:T , ft a, «) 



(C.15) 



(ft, ... , ftf, ft) | t, k, k, w, yi-.r, 7 ~ Dir(m.i, . . . , m. R ,i) 



p(w | m, zi-.t, /3,a, K)p(m \ z\-t, (3, a, k). 




tj T . , a, n)p{kj t = k \ ft a, «) 



(C.17) 
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where n-^ 1 is the number of customers sitting at table t in restaurant j, not count- 
ing yji. Similarly, are the table assignments for all customers except and 
k~ 3t are the dish assignments for all tables except table t in restaurant j. We re- 
call that Tj is the number of currently occupied tables in restaurant j. The form 
of Eq. (C.17) implies that a customer's table assignment conditioned on a dish 
assignment k follows a DP with concentration parameter a(3 k + nd(k,j). That is, 

tji | k jtji = k,t- ji ,k~ jt ^,y 1:T ,/3,a,K^ it', j? ~ GEM(a(3 k + KS(k,j)). 

For Z{ ~ /3 and /? ~ GEM(7) Antoniak (1974) provides a form for the distribu- 
tion of the number of unique values of Zi resulting from N draws from the measure 
/3. Letting K be the number of unique values of {z\, . . . , zjy}, this distribution is 
given by: 

(C.18) p(K | 2V, 7) = f WL- 8 (N,K)T K , 

where s(n,m) are unsigned Stirling numbers of the first kind.Then, Eq. (C.18) 
provides the form for the distribution over the number of unique components (i.e., 
tables) generated by sampling njk times from this stick-breaking distributed mea- 
sure, where we note that for the HDP-HMM nj^ is the number of customers in 
restaurant j eating dish k: 



(C.19) p(m jk = m\ n jk ,f3,a,K) 

r(a/3 k + K 6(k,j)) 



■s(n jk ,m)(a/3 k + K5(kJ)) r 



r(a/3 fc + KS(k,j) + n jk 

For large ?ij k , it is often more efficient to sample rrij k by simulating the table as- 
signments of the Chinese restaurant, as described by Eq. (C.17), rather than having 
to compute a large array of Stirling numbers. 

We now derive the conditional distribution for the override variables Wjt- The 
table counts provide that m,j k tables are serving dish k in restaurant j. If k 7^ j, we 
automatically have nij k tables with wj t = since the served dish is not the house 
specialty. See Fig. 17. Otherwise, for each of the rrijj tables t serving dish kj t = j, 
we start by assuming we know the considered dish index kj t , from which inference 
of the override parameter is trivial. We then marginalize over all possible values of 
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(a) Generative (b) Inference 



FIG 17. (a) Generative model of considered dish indices kjt (top) being converted to served dish 
indices kjt (bottom) via override variables Wjt- (b) Perspective from the point of view of an inference 
algorithm that must infer kjt and Wjt given kjt- If kjt 7^ j, then the override variable Wjt is 
automatically implying that kjt = kjt, as indicated by the jagged arrow. If instead kjt = j, then 
this could have arisen from the considered dish kjt being overridden (Wjt = 1) or not (Wjt = 0), 
These scenarios are indicated by the dashed arrow. If the considered dish was not overridden, then 
kjt = kjt = j. Otherwise, kjt could have taken any value, as denoted by the question mark. 



this index: 

kjt = j, P, p) 

K 

= ^2 p(k jt ,w jt I kjt = j, P) + p(kjt = K + 1, w jt I k jt = j, (3) 
k jt =i 

K 

oc ^2 P( k it = J I kjt,Wjt)p{kj t I P)p{w jt I p) 
k jt =i 

+ P(kjt = j I kjt = K + l,Wjt)p(kj t =K + 1\ P)p{wj t I p) 
f PjO- - p), w jt = 0; 

\ P, w jt = 1, 

where p = is the prior probability that Wjt = 1. This distribution implies that 
having observed a served dish kj t = j makes it more likely that the considered 
dish kjt was overridden via choosing Wjt = 1 than the prior suggests. This is 
justified by the fact that if Wjt = 1, the considered dish kjt could have taken any 
value and the served dish would still be kjt = j- The only other explanation of 
the observation kjt = j is that the dish was not overridden, namely Wjt = 
occurring with prior probability (1 — p), and the table considered a dish kjt = j, 
occurring with probability j3j. These events are independent, resulting in the above 
distribution. We draw rrijj i.i.d. samples of Wjt from Eq. (C.20), with the total 
number of dish overrides in restaurant j given by Wj. = Y^t w jt- The sum of these 
Bernoulli random variables results in a binomial random variable. 



p(wjt 



(C.20) 
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Given rrijk for all j and k and Wjt for each of these instantiated tables, we can 
now deterministically compute fhjk, the number of tables that considered ordering 
dish k in restaurant j. Any table that was overridden is an uninformative observa- 



Note that we are able to subtract off the sum of the override variables within a 
restaurant, Wj., since the only time Wjt = 1 is if table t is served dish j. From 
Eq. (C.21), we see that K = K. 

The resulting direct assignment Gibbs sampler is outlined in Algorithm 1 . 

C.2. Sticky HDP-HMM with DP emissions. In this section we derive the 
predictive distribution of the augmented state (zt,st) of the sticky HDP-HMM 
with DP emissions. We use the chain rule to write: 

(C.22) p(z t = k,s t = j | z\ t , s\ t ,yi; T , P, a, a, k, A) 

= P(st = j | zt = k, z\ t , s\ t ,yi;T, cr, A) 

p(zt = k | z\ t , s\ t , yi-.r, P, a, «, A). 

We can examine each term of this distribution by once again considering the joint 
distribution over all random variables in the model and then integrating over the 
appropriate parameters. For the conditional distribution of z t = k when not given 
St, this amounts to: 

p(zt = k | z\ t , s\ t , yi-.r, P, a, k, A) oc / ~[p(nj \ a, P, k) JJpOt I n^Jdir 



tion for the posterior of so that 



(C.21) 





(C.23) 



oip(z t = k | z\ t ,P,a,K) 




p{yt I {y T \ z T = k, s t ,r / 1}, A). 
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Given the previous state assignments z[ n T and global transition distribution /3'™ 1 ' : 

1. Set zi-.T = z^t 11 and /3 = /3 {n ~ 1} . For each t €{!,..., T}, sequentially 

(a) Decrement n Zt _ lzt and n z - tZt+1 and remove y t from the cached statistics for the 
current assignment zt = k: 

(/tfejEfc) Quit,!]*;) 9 S/t 

£fc «- £fc - 1 

(b) For each of the K currently instantiated states, determine 

/fc(yt) = (aPfe +n 2t _ 1 * : ) — U k (y t ;nh,^k) 

\ a + rifc. + « / 

for Zt-i 7^ A:, otherwise see Eq. (C.10). Also determine probability fK+i(yt) of a 

new state K + 1. 

(c) Sample the new state assignment zt: 

K 

If 2t = A" + 1, then increment A and transform /3 as follows. Sample b ~ Beta(l, 7) 
and assign /3 K <- b/3- k and /3 S <- (1 - &)/%, where /3 S = Y.T=k+i Pk- 

(d) Increment n Zt _ lZt and n ztZt+1 and add y t to the cached statistics for the new 
assignment zt = k: 

fJ&fc.Efc) <- (Afc, S fc ) © y t 
i>fc ^ — i> fc + 1 

2. Fix z[ n ^, = 2i:T- If there exists a j such that rij. =0 and n.j = 0, remove j and decrement 
K. 

3. Sample auxiliary variables m, io, and m as follows: 

(a) For each (j,k) £ {1, . . . , K} 2 , set m.^ = and n — 0. For each customer in 
restaurant j eating dish k, that is for i = 1, . . . , rijfc, sample 

\n + a/3 k + Kd(j, k) i 
Increment n, and if x = 1 increment mjfc. 

(b) For each j £ {1, . . . , A"}, sample the number of override variables in restaurant j: 

P 



w r ~ Binomial m }j , 

V P + Pi0--P) 
Set the number of informative tables in restaurant j considering dish k to: 

mjk, j k; 

m,jj-Wj., j = k. 

4. Sample the global transition distribution from 

|8 (n) ~ Dir(m.i, . . . ,m. K ,~/) 

5. Optionally, resample the hyperparameters 7, a, and k as described in Supplementary 
Material E. 



Algorithm 1: Direct assignment Rao-Blackwellized Gibbs sampler for the sticky HDP- 
HMM. The algorithm for the HDP-HMM follows directly by setting k = 0. Here, we 
assume Gaussian observations with a normal-inverse- Wishart prior on the parameters of 
these distributions (see Supplementary Material C). The © and operators update cached 
mean and covariance statistics as assignments are added or removed from a given compo- 
nent. 
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The term p(z t = k \ z\ t , f3, a, k) is as in Eq. (C.10), while 



(C.24) p(s t = j | {s T | z T = k,T / t}, a) 



= < 




je{l,...,K' k }; 



which is the predictive distribution of the indicator random variables of the DP 
mixture model associated with z t = k. Here, is the number of observations y T 
with [z T = k,s T = j) for r ^ t, and K' k is the number of currently instantiated 
mixture components for the k th emission density. 

We similarly derive the conditional distribution of an assignment s t = j given 
zt = k as: 



P(s t = j | z t = k, z\ t , s\ t , yi-.r, cr, A) oc p(s t = j \ {s T \ z T = k,r / t}, a) 



is derived in the same fashion as Eq. (C.12) where now we only consider the obser- 
vations y T that are assigned to HDP-HMM state z T = k and mixture component 

Sq- — k . 

The direct assignment Gibbs sampler for the sticky HDP-HMM with DP emis- 
sions is outlined in Algorithm 2. 



(C.25) 



V{Vt I {Vr \ z T = k,s t = j, T / t}, A). 



The likelihood component of these distributions, 



P{yt | {y T I z T = k,s t = j, t ^ t}, A) 
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Given a previous set of augmented state assignments (z[" T , Sj" T ) and the global transition 
distribution j3 {n ~ l) : 

1. Set (zut, si :T ) = (z^t^ > s&r^ ) and Z 3 = /3 ( " _1) • For each f G 

(a) Decrement n Zt _ lZt , n ztZt+1 , and ro' ztst and remove yt from the cached statistics for 
the current assignment (zt, St) — (k,j): 

(fik,j,£k,j) <- (p,k,i,t k ,j) Qyt 

(>h,i «- f>k,j - 1 

(b) For each of the K currently instantiated HDP-HMM states, compute 

i. The predictive conditional distribution for each of the K' k currently instantiated 
mixture components associated with this HDP-HMM state 

f'k,j(yt) = [ *\ ) to Kj {y t ; Afe,j , iSfc,j) 
\u -f n k . j 

and for a new mixture component K' k + 1 

fk,K'+i(yt) = — rzr t{> o(yf> £o,£o). 

fc' 

ii. The predictive conditional distribution of the HDP-HMM state without 
knowledge of the current mixture component 

f i \ la, ,fa/3 Zt+1 +n kz +n5(k,zt+i)\ I ^ , , \ 

/fe(2/t) = (a/3 fe + n^_ lfc ) ^ ± - ±- + ^ J I 2^ f k ,j{vt) + fk,K' k +i(vt) I 

for z t -i 7^ fc, otherwise see Supplementary Material C.2. Repeat this procedure 
for a new HDP-HMM state A + 1 with K' K+1 initialized to 0. 

(c) Sample the new augmented state assignment (zt, s t ) by first sampling z t : 

K 

* ~ ^/ fc (y t )^t,fc) + /A-+i(yt)<5(^,A' + l). 
fc=i 

Then, conditioned on a new assignment z± = fc, sample st: 

«"* 

s« ~ ^2f'k,j(yt)S(s t ,j) + fL K > k+1 (yt)5{s t , K' k + 1). 
j=i 

If fc = A + 1, then increment A and transform /3 as follows. Sample b ~ Beta(l, 7) 
and assign /?a" <— f>/3j. and <— (1 — &)/%. If s t = A'£ + 1, then increment K' k . 

(d) Increment n zt _ izt , n ztZt+1 , and 7i'~ tst and add yt to the cached statistics for the new 
assignment (z t ,s t ) = (fc, j): 

(Afej.Sfc.i) <- (Afe,j,s fc ,j) 2/ t 

2. Fix (z^, «i"t) = (2i:T, su). If there exists a fc such that rik- = and n.k = 0, remove fc 
and decrement A'. Similarly, if there is a (fc, j) such that n' k j = then remove j and 
decrement Aj[. . 

3. Sample auxiliary variables m, w, and m as in step 3 of Algorithm 1. 

4. Sample the global transition distribution as in step 4 of Algorithm 1. 

5. Optionally, resample the hyperparameters a, 7, a, and k as described in Supplementary 
Material E. 



Algorithm 2: Direct assignment Rao-Blackwellized Gibbs sampler for the sticky HDP- 
HMM with DP emissions. 
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APPENDIX D: BLOCKED SAMPLER 

In this section, we present the derivation of the blocked Gibbs samplers for the 
sticky HDP-HMM and sticky HDP-HMM with DP emissions. The resulting Gibbs 
samplers are outlined in Algorithms 3 and 4. 

D.l. Sampling (3, 7r, and tp. The order L weak limit approximation to the DP 
gives us the following form for the prior distribution on the global weights (3: 

(D.l) /?| 7 ~Dir(7/X,...,7/X). 

On this partition, the prior distribution over the transition probabilities is Dirichlet 
with parametrization: 

(D.2) iTj I a, k, f3 ~ Dir(a/3i, . . . , a(3j + k,. . . , a^L,)- 

The posterior distributions are then given by: 

(D.3) /3 | m, 7 ~ Dir(7/L + 771.1, . . . , 7/L + m. L ) 

ttj I zi-.t, a,P ~ Dir(a/3i + nji, . . . , a(3j + k + njj, . . . , afti + rij L ), 

where we recall that rijk is the number of j to k transitions in the state sequence 
z\ : t and fhjk is the number of tables in restaurant j that considered dish k. The 
sampling of the auxiliary variables fhjk is as in Supplementary Material C. 

For the sticky HDP-HMM with DP emissions, an order L' weak limit approx- 
imation to the DP prior on the emission parameters yields the following posterior 
distribution on the mixture weights V>fc : 

(D.4) ip k I z 1:T ,s 1:T ,a ~Dir(a/L' + n' kl ,. . . ,a/L' + n kL ,), 

where n' ke is the number of observations assigned to the £ th mixture component of 
the k th HMM state. 

D.2. Sampling zi;T for the Sticky HDP-HMM. To derive the forward-backward 
procedure for jointly sampling z\-t given y± : T for the sticky HDP-HMM, we first 
note that 

P(Z1:T I yi:T,TT,0) = P(z T \ Z T -1, Vl-.T,™ , 0)p(z T -l \ Z T -2, Vl:T, 7T, 0) 

■ ■■p(z2 | z 1 ,yi :T ,Tv,9)p{zi \ yv. T ,^,0). 

Thus, we may first sample z\ from p(z± \ yi-.r, f, P, 9), then condition on this 
value to sample Z2 from p(z2 \ z\,y\-T, tt, 0), and so on. The conditional distribu- 
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tion of z\ is derived as: 

P(Z1 I Vl:T,Tr,0) OC p(z 1 )f(y 1 \ d zl )^2Y]_p{z t \ K Z t-l)f{Vt I &z t ) 

Z 2 ;T t 

°zp(zi)f(yi | 6 Zl )^ j p(z 2 I vr Zl )/(y 2 | Z2 )m^ 2 (.z 2 ) 

Z2 

(D-5) (xp(z 1 )f(y 1 | 2l )m 2 ,iOi), 

where m t j-i{zt-i) is the backward message passed from zt to zt-i and for an 
HMM is recursively defined by: 

i, t — r + lj 

(D.6) otp(y t]T j z t _i,7r,0). 

The general conditional distribution of z t is: 

(D.7) p(z t | zt-i,yi:T,n,Q) ocp(z t | n Zt _ 1 )f(y t \ 6 Zt )m t+ i,t(zt)- 
The resulting blocked Gibbs sampler is outlined in Algorithm 3. 

D.3. Sampling (z 1:T , s 1:T ) for the Sticky HDP-HMM with DP emissions. 

We now examine how to sample the augmented state (z t , s t ) of the sticky HDP- 
HMM with DP emissions. The conditional distribution of (zt, s t ) for the forward- 
backward procedure is derived as: 

(D.8) 

p{z t ,S t | Z t -l,yi:T,-K,1p,0) Otp(z t \ TT^^St \ tp Zt )f(yt I z t ,s t )m t +l,t( Z t) ■ 

Since the Markovian structure is only on the zt component of the augmented state, 
the backward message mt,t-i(zt~i) from (zt, st) to (zt-i, sj-i) is solely a func- 
tion of Zt-i. These messages are given by: 

(D.9) m t ,t-i(z t -i) 

Ylz t E St P( z t I ^z t -i)p(st I ipz t )f(yt | 9 ZuSt )m t+u {z t ), t < T; 
1, t = T + l. 



oc 



More specifically, since each component j of the k th state-specific emission distri- 
bution is a Gaussian with parameters 9jk = {^k,j, ^k,j}> we have: 

(D.10) p(z t = k,s t = j | zt-i,yi:T,ir,i>,0) 

^ 7Ta t _i (fyipkUWiVf, Vk,j, ^k,j)m t +i,t{ k ) 
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Given a previous set of state-specific transition probabilities 7r' n 1 ', the global transition 
distribution /3 t - n ^ 1 \ and emission parameters l,n ^ 1 " 1 : 

1. Set 7r = 7r' n_1 ' and 6 = 0' n_1 \ Working sequentially backwards in time, calculate 
messages m t ,t-i(fc) : 

(a) For each k € {1, . . . , L}, initialize messages to 

m T +i,T(fc) = 1 

(b) For each t 6 {T — 1, . . . , 1} and for each k 6 {1, . . . , L}, compute 

L 

m t ,t-\ ; Hj,T,j)mt+i,t(J) 

3=1 

2. Sample state assignments z\-t working sequentially forward in time, starting with riju = 
and y k = for each (j, fe) 6 {1, . . . , L} 2 : 

(a) For each k £ {1, . . . , L}, compute the probability 

fk(yt) = n Zt _ 1 (k)Af(y t ; (ih, Sfe)m t +i,t(fc) 

(b) Sample a state assignment z t : 

L 

Zt ~^2f k (yt)3(z t ,k) 
fe=i 

(c) Increment n zt _ izt and add y t to the cached statistics for the new assignment Zt = k: 

y k <- yk e y t 

3. Sample the auxiliary variables m, w, and m as in step 3 of Algorithm 1. 

4. Update the global transition distribution by sampling 

/8 ~ Dir(7/L + m.i, . . . ,j/L + fh. L ) 

5. For each k £ {1, . . . , L}, sample a new transition distribution and emission parameter 
based on the sampled state assignments 

7r fe ~ Dir(a/3i +n kl ,..., a/3 k + ft + n kk , a/3 L + n kL ) 

6 k ~ p(fl|A,^b) 
See Supplementary Material D.4.1 for details on resampling 8 k . 

6. Fix 7r (n) = 7T, /3 (n) = /3, and = 0. 

7. Optionally, resample the hyperparameters 7, a, and k as described in Supplementary 
Material E. 



Algorithm 3: Blocked Gibbs sampler for the sticky HDP-HMM. The algorithm for the 
original HDP-HMM follows directly by setting n = 0. Here, we assume Gaussian ob- 
servations with an independent Gaussian prior on the mean and inverse- Wishart prior on 
the covariance (see Supplementary Material D.4.1). The set 34 is comprised of the statis- 
tics obtained from the observations assigned to state k that are necessary for updating the 
parameter 6 k = {fi k , The © operator updates these cached statistics as a new assign- 
ment is made. 
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(D.ll) 



m t +i,t{k) 



L V 

i=l 1=1 



(D.12) m T+1:T (k) = l k = l,...,L. 

D.4. Sampling 9. Depending on the form of the emission distribution and 
base measure on the parameter space 0, we sample parameters for each of the 
currently instantiated states from the updated posterior distribution. For the sticky 
HDP-HMM, this distribution is: 



For the sticky HDP-HMM with DP emissions, the posterior distribution for each 
Gaussian's mean and covariance, 9k j, is determined by the observations assigned 
to this component, namely, 



The resulting blocked Gibbs sampler for sticky HDP-HMM with DP emissions 
is outlined in Algorithm 4. 

D.4.1. Non-Conjugate Base Measures. Since the blocked sampler instantiates 
the parameters 9k, rather than marginalizing them as in the direct assignment sam- 
pler, we can place a non-conjugate base measure on the parameter space 6. Take, 
for example, the case of single Gaussian emission distributions where the parame- 
ters are the means and covariances of these distributions. Here, 0% = {fi^, In 
this situation, one may place a Gaussian prior Af(fio, So) on the mean and an 
inverse-Wishart A) prior on the covariance S^. 

At any given iteration of the sampler, there is a set of observations Yk = {yt \ 
zt = k} with cardinality \Yk\. The posterior distributions over the mean and co- 
variance parameters are: 



(D.13) 



0j | z 1:T , yv.T, \~p{9\ {y t | zt = j}, A). 



(D.14) 



0k,j I Zl:T, Sl-.T, Vl:T, A ~ p(9 \ {y t \ (z t =k,S t = j)}, A). 



(D.15) 



Sfc I Mfe 
Mfc I s fc 



TW(v k A k , v k ) 



where 




t& k 



(s^ + ims- 1 ) 




THE STICKY HDP-HMM 



61 



Given a previous set of state-specific transition probabilities 7r' n 1 \ emission mixture weights 
i/'' n_1 ', global transition distribution /3 l,n ^ 1 \ and emission parameters 0' n ~ 1 '; 

1. Set 7T = 7r (n_1) , ip = V (n_1) an d 9 = (n_1) . Working sequentially backwards in time, 
calculate messages mt,t-i(k) : 

(a) For each k 6 {1, . . . , L}, initialize messages to 

m T +i,T(fc) = 1 

(b) For each t £ {T — 1, . . . , 1} and for each k £ {1, . . . , L}, compute 

L L' 

m t ,t-i(k) = ^2^2 Kk{i)i)i{t)N{yt+r, m,£, )m t+ i, t (i) 

i=l €=1 

2. Sample augmented state assignments (zi : t, si : t) working sequentially forward in time. 
Start with n ifc = 0, n' kj = 0, and y kJ = for (i, k) £ {1, ... , L} 2 and 
(fc,j)£{l,...,L}x{l,...,L'}. 

(a) For each (k, j) £ {1, . . . , L} x {1, . . . , L'}, compute the probability 

fk,j(yt) = 7T Zt _ 1 (fc)V) fc (j)7V(yi;^ fc j,S fc , j )mi + i,i(A:) 

(b) Sample an augmented state assignment (zt,st): 

L L' 

{z t ,s t ) ~^2^2f k ,j(yt)6(zt,k)5(st,j) 

k=l j=l 

(c) Increment n zt _ izt and n' ztst and add yt to the cached statistics for the new 
assignment (z t , St) = (A:, j): 

3. Sample the auxiliary variables m, w, and m as in step 3 of Algorithm 1. 

4. Update the global transition distribution ft as in step 4 of Algorithm 3. 

5. For each k £ {1, . . . , L}, 

(a) Sample a new transition distribution n k and emission mixture weights ipk'. 

Tv k ~ Dir(a/3i + n fc i, . . . , a/3 k + k + n kk , ■ ■ ■ , a/3 L + n kL ) 
tpk ~ Dk(a/L' + n' kl , . . . ,a/L' + n' kL ,) 

(b) For each j £ {1, . . . , L'}, sample the parameters associated with the j mixture 
component of the k th emission distribution: 

9h,i ~ p(o\\y k ,i) 

See Supplementary Material D.4.1 for details on resampling 8 k j. 

6. Fix 7T (n) = 7T, i/> (,l) = ip, = P, and 6» (,I) = 0. 

7. Optionally, resample the hyperparameters a, 7, a, and «; as described in Supplementary 
Material E. 



Algorithm 4: Blocked Gibbs sampler for the sticky HDP-HMM with DP emissions. 
Here, we use an independent Gaussian prior on the mean and inverse- Wishart prior on the 
covariance (see Supplementary Material D.4.1). The set y~k,j is comprised of the statistics 
obtained from the observations assigned to augmented state (k,j) that are necessary for 
updating the parameter 6t j = {/J>k,j , Sfcj }■ The © operator updates these cached statistics 
as a new assignment is made. 
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The sampler alternates between sampling fj, k given £& and T, k given \x k several 
times before moving on to the next stage in the sampling algorithm. The equations 
for the sticky HDP-HMM with DP emissions follow directly by considering Y k j = 
{y t | Zt = k,s t = j] when resampling parameter 9 k j = {fi kij , S fej -}. 
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APPENDIX E: HYPERPARAMETERS 

In this section we present the derivations of the conditional distributions for the 
hyperparameters of the sticky HDP-HMM. These hyperparameters include a, k, 
7, a, and A, where A is considered fixed. Many of these derivations follow directly 
from those presented in Escobar and West (1995) and Teh et al. (2006). 

We parameterize our model by [a + k) and p = n/(a + k); this simplifies the 
resulting sampler. We place Gamma(a, b) priors on each of the concentration pa- 
rameters (a + k), 7, and a, and a Beta(c, d) prior on p. The a and b parameters of 
the gamma hyperprior may differ for each of the concentration parameters. In the 
following sections, we derive the resulting posterior distribution of these hyperpa- 
rameters. 

E.l. Posterior of (a + k). Let us assume that there are J restaurants in the 
franchise at a given iteration of the sampler. Note that for the HDP-HMM, J = K. 
As depicted in Fig. 16(b), the generative model dictates that for each restaurant 
j we have ttj ~ GEM(a + k), and a table assignment is determined for each 
customer by tji ~ ttj. In total there are rij. draws from this stick-breaking measure 
over table assignments resulting in rrij. unique tables. By Eq. (C.18) and using the 
fact that the restaurants are mutually conditionally independent, we may write: 

p(a + k | mi., . . . ,mj.,ni., . . . ,nj.) 

oc p(a + K)p(mi. , mj. \ a + k, n\. , . . . , n j. ) 
J 

oc p(a + k) JJp(m-j. | a + k, rij.) 



x p(a + k) M s(nj., mj . a + K ) m r — — ■ ■ 

f-^r T(a + K + nj.) 



(E.1) K1<a + « Ka + Br . n _^_J_ 

j = l J 



Using the fact that the gamma function has the property T(z + 1) = zT(z) and 
is related to the beta function via /3(x, y) = T(x)T(y)/T(x + y), we rewrite this 
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distribution as 

p(a + k | mi.,. . . ,mj.,ni.,. . . ,nj.) 



, \ f ™ tt ( a + K + nj.)/3(a + k + l,rij.) 
oc p(o + + A)- II * (a + ^n,-.) " 



(E.2) = p(a + «)(a + k)" 1 ' + -^-^ jf rf^l - /;,•)"-• ',/,> 

where the second equality arises from the fact that (3(x, y) = Q - ty^dt. 

We introduce a set of auxiliary random variables r = {ri, . . . ,rj}, where each 
rj G [0, 1]. Now, we augment the posterior with these auxiliary variables as fol- 
lows: 

p(a + K,r\ mi.,. . . ,mj.,ni.,. . . ,nj.) 

J / N 

\n,. — 1 



oc p(a + «)(q + " TT (l + r" +K (l - r,)^' 

-M V a + re/ J 

oc (a + K )«H^..-i e -(a+«)6 TT ( x + r «+«n _ r 

-Mr \ a + k J J 



3=1 



(E.3) = (a + K )»+"^--i e -(«+«)6 J j j r «+*(i _ rj)^ - 1 

j=is j e{o,i} v 7 

Here, we have used the fact that we placed a Gamma(a, b) prior on (a + k). We 
add another set of auxiliary variables s = {s\, . . . , sj}, with each Sj 6 {0, 1}, to 
further simplify this distribution. The joint distribution over (a + k), r, and s is 
given by 

(E.4) p(a + K,r, s | mi., . . . , mj., m., . . . , nj.) 

J 



oc (a + K )«+t»—i e -(«+«)6 JJ j _±_ j r «+«(i _ rj .)»i--i. 



i=i 
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Each conditional distribution is as follows: 

p(a + k | r,s,mi.,. . . ,mj.,ri\., . . .,nj.) 

oc {a + K) a+m - 1 

= Gamma(a + m, 
p(rj | a + K,r\j, s,m!., . . . ,mj.,n h , . . . ,nj.) oc 

p(sj | a + k, r, s\j , mi., . . . , mj. , m. , . . . , nj.) oc 



E.2. Posterior of 7. We may similarly derive the conditional distribution of 
7. The generative model depicted in Fig. 16(b) dictates that /3 ~ GEM(7) and 
that each table t considers ordering a dish kj t ~ /3. From Eq. (C.21), we see 
that the sampled value fhj. represents the total number of tables in restaurant j 
where the considered dish kj t was the served dish kj t (i.e., the number of tables 
with considered dishes that were not overridden.) Thus, m.. is the total number 
of informative draws from /3. If K is the number of unique served dishes, which 
can be inferred from zi-t, then the number of unique considered dishes at the 
informative tables is: 

K K 

(E.5) K = ^2 1 ( 7fl -fe > °) = K ~Y1 l ^- k = and 7Ukk > °)' 

k=l k=l 

We use the notation 1(A) to represent an indicator random variable that is 1 if 
the event A occurs and otherwise. The only case where K is not equivalent to 
K is if every instance of a served dish k arose from an override in restaurant k 
and this dish was never considered in any other restaurant. That is, there were no 
informative considerations of dish k, implying m.k = 0, while dish k was served 
in restaurant k, implying > so that k is counted in K. This is equivalent 
to counting how many dishes k had an informative table consider ordering dish k, 
regardless of the restaurant. We may now use Eq. (C.18) to form the conditional 



-£/=i «j e -(«+«)(6-E/=i lo s^) 



„a+K 



m.—l 



r J +K (l- rj ) 
Beta(a + k + 1, nj.) 



a + k 



Ber 



n ; 



n ■ 4- rv 4- k 



66 



E. FOX ET AL. 



distribution on 7: 

p(7 I K,fh..) oc p(pj)p{K | 7,777..) 

oc p{^)s{m..^K)^ 



k r( 7 ) 



r(7 + m..) 

oc ^(7)7 



^ (7 + m..)/3(7 + l,m..) 



7r(m..) 

(E.6) a p(7)7 A ' _1 (7 + ?ti-) / ^{l - rf)™ — l dq. 



As before, we introduce an auxiliary random variable 77 € [0, 1] so that the joint 
distribution over 7 and 77 can be written as 

p(7, 77 I K, fh..) oc p(7)7 A '~ 1 (7 + ?fi..)77 7 (l — 77) m " _1 
(E.7) oc 7 a+/? - 2 ( 7 + m..)e- 7(b - logr ' ) (l - 77)^— \ 

Here, we have used the fact that there is a Gamma(o, b) prior on 7. We may add an 
indicator random variable ( £ {0, 1} as we did in Eq. (E.4), such that 

p(j,V,C I K,fh..) oc 7«+^-! ^Ve-T^-iog'?) (i-^—i. 

The resulting conditional distributions are given by: 

p(7 I r],(,K,m..) oc 7 <H-ir-i-<; e -7(&-i°g»?) 

= Gamma(a + .K' — 6 — log 77) 
73(77 I 7, £, -K", "7,-) oc 77 7 (1 — 77)*" " _1 = Beta(7 + 1, 777,..) 

fh..\^ ( fa.. 



(E.8) p(C I 7, 77, if, m..) oc — 1 = Ber 

V 7 / \m.. + 7, 

Alternatively, we can directly identify Eq (E.7) as leading to a conditional distribu- 
tion on 7 that is a simple mixture of two Gamma distributions: 

p(7 \v,K,m..) oc j a+R - 2 { 1 + fh..)e-^ b - losri) 

(E.9) oc 7TmGamma(a + K,b — log 77) 

+(1 — 7Tm)Gamma(a + K — 1, b — log 77) 

(E.10) p(7? I j,K,fh..) oc 77 7 (1 - 77)™' " 1 = Beta(7 + 1,777..), 

where 

q + if - 1 
m 777.. (b — log 77) 

The distribution in Eq. (E.3) would lead to a much more complicated mixture of 
Gamma distributions. The addition of auxiliary variables Sj greatly simplifies the 
interpretation of the distribution. 
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E.3. Posterior of cr. The derivation of the conditional distribution on a is 
similar to that of (a + k) in that we have J distributions ipj ~ GEM(er). The state- 
specific mixture component index is generated as st ~ tp Zt implying that we have 
rij. total draws from ipj, one for each occurrence of zt = j. Let K'- be the number 
of unique mixture components associated with these draws from ipj. Then, after 
adding auxiliary variables r' and s', the conditional distributions of a and these 
auxiliary variables are: 

p(a | r',s',K[., 

P{r'j | v, r '\j, s ', K l, 
p{s'j | a,r' ,s'^,K[., 

In practice, it is useful to alternate between sampling the auxiliary variables 
and concentration parameters a, 7, and a for several iterations before moving to 
sampling the other variables of this model. 



...,Kj.,m.,...,nj.] 



oc (a) a+K! ■" 1 -S/=i •J e -W(6-E/=i i°g^) 
— 



E.4. Posterior of p. Finally, we derive the conditional distribution of p. We 
have m.. = Ylk m -k tota l draws of Wj t ~ Ber(p), with J2j w j- tne number of 
Bernoulli successes. Here, each success represents a table's considered dish being 
overridden by the house specialty dish. Using these facts, and the Beta(c, d) prior 
on p, we have 

p(p I w) oc p(w j p)p(p) 
m.. 



22j w r J r ( c ) r W 



a p E, tiy.+c-l^ _ /0 )m..-E J - toj.+d-l 
(E.ll) oc Beta | u>j. + c, m.. — ^ Wj. + 
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